Releases · Nexesenex/croco.cpp

Experimental version of LostRuins' KoboldCPP with LostRuins@2dc9668
Unlimited context to be selected, to be tested to see if it works beyond 16384
96 banned tokens and end of sequence tokens instead of 10 (to be tested).
KV-Q_8_0 cache by Johannes Gaessler enabled : the KV cache takes 50% less VRAM than before, offering almost a double context (minus the growth of the Blast Batch buffer due to the growth of context.
-> And so you can divide the BBS size by two to get an exact double context with KV-Q_8_0 compared to KV-FP16, but the prompt processing will be slower lol).

All of this is to be tested thoroughly, but it loads and occupies the VRAM as expected.. with zero context. I'll report tonight over my first tests.

Enjoy, until LlamaCPP master & KoboldCPP get updated with that new feature !

Edit : The initial VRAM leak is back, in its "fast version". The "old fix" works, but then the output is rubbish. I'll wait for the real devs to do their job. :D

Changelog of my "releases" -
V2 👍 (1.43.b1216) Official LlamaCPP fix for MMQ (the BBS buffer doesn't grow anymore after its per-allocation accordingly to context size)
V1 👍 (1.43.b1204e, offline now) Frankenstein fix for MMQ by a code swap (the BBS buffer grows, but slowly and not fast anymore)

Assets 4

16 Sep 13:44

Nexesenex

1.43.b1216

2dc9668

kobold.cpp-elephantastic_experimental_v1.43.b1216

Kobold CPP v1.43 with CUDA/CUBLAS MMQ fixed (buffers are allocated properly from the start), and unrestricted context.
CodeLlama2 c34b in Q4_K_S can run with 16384 context on a GTX 3090/4090 used as a second graphic card.

Assets 3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: Nexesenex/croco.cpp

KoboldCPP_Frankenstein_Experimental_1.52_Mixtral_v2

KoboldCPP_Frankenstein_Experimental_1.52_Mixtral

kobold.cpp-kvadratic_experimental_v1.43.b1255_KVQ8

kobold.cpp-elephantastic_experimental_v1.43.b1216