Gemma 3 KV cache quantization issue #1423

Newtrial99 · 2025-03-15T04:58:17Z

KV cache quantization significantly slows down the performance of Gemma 3 , even with smaller models and lower context lengths where it should provide benefits. This slowdown is observed across different quantization levels and context lengths. Other non-Gemma 3 models (12B) do not exhibit this issue with KV cache quantization and works fine as intended.

Initial Observation (12B Model, iq4_xs):

With an RTX 4060 (8GB VRAM), I use the iq4_xs quantization of a 12B LLM [due to vram limitation]

At an 8192-context length, with context shifting and Flash Attention enabled, performance is normal [within speed range as expected of 4060]

However, enabling 8-bit KV cache quantization at 16384 context length reduces the speed to approximately 1/10th of the normal speed.

Even reducing the context length to 4096 with 8-bit KV cache quantization results in significantly slower performance than 8192 context length without KV cache quantization.

Testing with Gemma 3 (4B, q4_k_m):

To rule out VRAM limitations, I tested Gemma 3, a 4B LLM, using q4_k_m weight quantization.

Without KV cache quantization, and with context shifting, performance is fast even well beyond 16384 context length.

However, enabling KV cache quantization, even at a much smaller 4096 context length, causes a dramatic slowdown.

conclusion.

The issue appears to be specific to Gemma 3 (or potentially its implementation). Other 12B LLMs quantized with iq4_xs do not experience this performance degradation with KV cache quantization. Even the smaller Gemma 3 (4B) model slows down drastically when KV cache quantization is enabled, despite having ample VRAM and a small context length.

Thanks a lot for what you do.

Newtrial99 · 2025-03-15T16:43:05Z

Is this mainly due to the upstream lamma cpp problem? [as I saw similar issues being mentioned there]

icsy7867 · 2025-03-16T14:18:24Z

I haven't tried anything that wasn't a llamacpp based inference engine, but I would think that llamacpp will see the fix before anything downstream would.

I'm excited about gemma 3, but the 27B parameter model (which seems perfect for 24gb cards), is going to work best, squeezing as much of that context into the gpu as possible.

So, for now I'm eagerly keeping an eye on llamacpp.

For reference:
ggml-org#12352 (comment)

VL4DST3R · 2025-03-16T18:47:54Z

Yeah I've also noticed this ever since I started using gemma-3-27b-it-Q4_K_M (24gb card here), although it seems to not be always consistent (-ly slow). From what I gather it seems like it uses a lot more cpu compared to other models loaded fully into vram, which like the user above mentioned, looks very familiar to the llama.cpp issue #12352.

In my case I believe it was particularly more noticeable to me because I have it set so that when I start kobold, it switches my windows power profile to one that disables the cpu boosting past its base freq (to keep the fans quiet when leaving my pc to just host kobold). This obviously slows down the cpu a bit, and now sometimes when I start generating, there is a noticeable delay before tokens actually start being generated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemma 3 KV cache quantization issue #1423

Gemma 3 KV cache quantization issue #1423

Newtrial99 commented Mar 15, 2025

Newtrial99 commented Mar 15, 2025

icsy7867 commented Mar 16, 2025 •

edited

Loading

VL4DST3R commented Mar 16, 2025 •

edited

Loading

Gemma 3 KV cache quantization issue #1423

Gemma 3 KV cache quantization issue #1423

Comments

Newtrial99 commented Mar 15, 2025

Newtrial99 commented Mar 15, 2025

icsy7867 commented Mar 16, 2025 • edited Loading

VL4DST3R commented Mar 16, 2025 • edited Loading

icsy7867 commented Mar 16, 2025 •

edited

Loading

VL4DST3R commented Mar 16, 2025 •

edited

Loading