Releases: Nexesenex/croco.cpp
KoboldCPP_Frankenstein_Experimental_1.52_Mixtral_v2
Based on the just released Kobold CPP v.1.52, compatible with Mixtral and Qwen, with :
- unlocked context size and blas batch size in command line
- Cublas version compiled with CUDA 11.0 (for a smaller koboldcpp_cuda.exe)
Original release from LostRuins, and the full changelog :
https://github.com/LostRuins/koboldcpp/releases/tag/v1.52
KoboldCPP_Frankenstein_Experimental_1.52_Mixtral
Test version compiled for BLAS modes except AMD HIPBLAS (but including CUBLAS 11), works with Mixtral 0.1 GGUF.
kobold.cpp-kvadratic_experimental_v1.43.b1255_KVQ8
Third release of mine 👍
- Experimental version of LostRuins' KoboldCPP with LostRuins@2dc9668
- Unlimited context to be selected, to be tested to see if it works beyond 16384
- 96 banned tokens and end of sequence tokens instead of 10 (to be tested).
- KV-Q_8_0 cache by Johannes Gaessler enabled : the KV cache takes 50% less VRAM than before, offering almost a double context (minus the growth of the Blast Batch buffer due to the growth of context.
-> And so you can divide the BBS size by two to get an exact double context with KV-Q_8_0 compared to KV-FP16, but the prompt processing will be slower lol).
All of this is to be tested thoroughly, but it loads and occupies the VRAM as expected.. with zero context. I'll report tonight over my first tests.
Enjoy, until LlamaCPP master & KoboldCPP get updated with that new feature !
Edit : The initial VRAM leak is back, in its "fast version". The "old fix" works, but then the output is rubbish. I'll wait for the real devs to do their job. :D
- Changelog of my "releases" -
V2 👍 (1.43.b1216) Official LlamaCPP fix for MMQ (the BBS buffer doesn't grow anymore after its per-allocation accordingly to context size)
V1 👍 (1.43.b1204e, offline now) Frankenstein fix for MMQ by a code swap (the BBS buffer grows, but slowly and not fast anymore)
kobold.cpp-elephantastic_experimental_v1.43.b1216
Kobold CPP v1.43 with CUDA/CUBLAS MMQ fixed (buffers are allocated properly from the start), and unrestricted context.
CodeLlama2 c34b in Q4_K_S can run with 16384 context on a GTX 3090/4090 used as a second graphic card.