Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gemma 3 KV cache quantization issue #1423

Open
Newtrial99 opened this issue Mar 15, 2025 · 3 comments
Open

Gemma 3 KV cache quantization issue #1423

Newtrial99 opened this issue Mar 15, 2025 · 3 comments

Comments

@Newtrial99
Copy link

KV cache quantization significantly slows down the performance of Gemma 3 , even with smaller models and lower context lengths where it should provide benefits. This slowdown is observed across different quantization levels and context lengths. Other non-Gemma 3 models (12B) do not exhibit this issue with KV cache quantization and works fine as intended.

Initial Observation (12B Model, iq4_xs):

With an RTX 4060 (8GB VRAM), I use the iq4_xs quantization of a 12B LLM [due to vram limitation]

At an 8192-context length, with context shifting and Flash Attention enabled, performance is normal [within speed range as expected of 4060]

However, enabling 8-bit KV cache quantization at 16384 context length reduces the speed to approximately 1/10th of the normal speed.

Even reducing the context length to 4096 with 8-bit KV cache quantization results in significantly slower performance than 8192 context length without KV cache quantization.

Testing with Gemma 3 (4B, q4_k_m):

To rule out VRAM limitations, I tested Gemma 3, a 4B LLM, using q4_k_m weight quantization.

Without KV cache quantization, and with context shifting, performance is fast even well beyond 16384 context length.

However, enabling KV cache quantization, even at a much smaller 4096 context length, causes a dramatic slowdown.

conclusion.

The issue appears to be specific to Gemma 3 (or potentially its implementation). Other 12B LLMs quantized with iq4_xs do not experience this performance degradation with KV cache quantization. Even the smaller Gemma 3 (4B) model slows down drastically when KV cache quantization is enabled, despite having ample VRAM and a small context length.

Thanks a lot for what you do.

@Newtrial99
Copy link
Author

Is this mainly due to the upstream lamma cpp problem? [as I saw similar issues being mentioned there]

@icsy7867
Copy link

icsy7867 commented Mar 16, 2025

I haven't tried anything that wasn't a llamacpp based inference engine, but I would think that llamacpp will see the fix before anything downstream would.

I'm excited about gemma 3, but the 27B parameter model (which seems perfect for 24gb cards), is going to work best, squeezing as much of that context into the gpu as possible.

So, for now I'm eagerly keeping an eye on llamacpp.

For reference:
ggml-org#12352 (comment)

@VL4DST3R
Copy link

VL4DST3R commented Mar 16, 2025

Yeah I've also noticed this ever since I started using gemma-3-27b-it-Q4_K_M (24gb card here), although it seems to not be always consistent (-ly slow). From what I gather it seems like it uses a lot more cpu compared to other models loaded fully into vram, which like the user above mentioned, looks very familiar to the llama.cpp issue #12352.

In my case I believe it was particularly more noticeable to me because I have it set so that when I start kobold, it switches my windows power profile to one that disables the cpu boosting past its base freq (to keep the fans quiet when leaving my pc to just host kobold). This obviously slows down the cpu a bit, and now sometimes when I start generating, there is a noticeable delay before tokens actually start being generated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants