Skip to content

kobold.cpp-kvadratic_experimental_v1.43.b1255_KVQ8

Compare
Choose a tag to compare
@Nexesenex Nexesenex released this 18 Sep 18:41
· 85 commits to kvadratic_experimental since this release
4f88fac

Third release of mine 👍

  • Experimental version of LostRuins' KoboldCPP with LostRuins@2dc9668
  • Unlimited context to be selected, to be tested to see if it works beyond 16384
  • 96 banned tokens and end of sequence tokens instead of 10 (to be tested).
  • KV-Q_8_0 cache by Johannes Gaessler enabled : the KV cache takes 50% less VRAM than before, offering almost a double context (minus the growth of the Blast Batch buffer due to the growth of context.
    -> And so you can divide the BBS size by two to get an exact double context with KV-Q_8_0 compared to KV-FP16, but the prompt processing will be slower lol).

All of this is to be tested thoroughly, but it loads and occupies the VRAM as expected.. with zero context. I'll report tonight over my first tests.

Enjoy, until LlamaCPP master & KoboldCPP get updated with that new feature !

Edit : The initial VRAM leak is back, in its "fast version". The "old fix" works, but then the output is rubbish. I'll wait for the real devs to do their job. :D

  • Changelog of my "releases" -
    V2 👍 (1.43.b1216) Official LlamaCPP fix for MMQ (the BBS buffer doesn't grow anymore after its per-allocation accordingly to context size)
    V1 👍 (1.43.b1204e, offline now) Frankenstein fix for MMQ by a code swap (the BBS buffer grows, but slowly and not fast anymore)