Release kobold.cpp-kvadratic_experimental_v1.43.b1255_KVQ8 · Nexesenex/croco.cpp

Third release of mine 👍

Experimental version of LostRuins' KoboldCPP with LostRuins@2dc9668
Unlimited context to be selected, to be tested to see if it works beyond 16384
96 banned tokens and end of sequence tokens instead of 10 (to be tested).
KV-Q_8_0 cache by Johannes Gaessler enabled : the KV cache takes 50% less VRAM than before, offering almost a double context (minus the growth of the Blast Batch buffer due to the growth of context.
-> And so you can divide the BBS size by two to get an exact double context with KV-Q_8_0 compared to KV-FP16, but the prompt processing will be slower lol).

All of this is to be tested thoroughly, but it loads and occupies the VRAM as expected.. with zero context. I'll report tonight over my first tests.

Enjoy, until LlamaCPP master & KoboldCPP get updated with that new feature !

Edit : The initial VRAM leak is back, in its "fast version". The "old fix" works, but then the output is rubbish. I'll wait for the real devs to do their job. :D

Changelog of my "releases" -
V2 👍 (1.43.b1216) Official LlamaCPP fix for MMQ (the BBS buffer doesn't grow anymore after its per-allocation accordingly to context size)
V1 👍 (1.43.b1204e, offline now) Frankenstein fix for MMQ by a code swap (the BBS buffer grows, but slowly and not fast anymore)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kobold.cpp-kvadratic_experimental_v1.43.b1255_KVQ8

Uh oh!