Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error in all versions newer than v1.79 #1412

Open
ivanesons opened this issue Mar 8, 2025 · 7 comments
Open

CUDA error in all versions newer than v1.79 #1412

ivanesons opened this issue Mar 8, 2025 · 7 comments
Labels
bug Something isn't working

Comments

@ivanesons
Copy link

ivanesons commented Mar 8, 2025

Describe the Issue
Starting from 1.80, most of the models causes the CUDA Error when context is long enough. It happens not instantly — the first 2-3 requests works fine, but after that happens an error with the next text in console:

[Context Shifting: Erased 96 tokens at position 2993]
Processing Prompt [BLAS] (97 / 97 tokens)
Caution: pre-allocated tensor (cache_k_l0 (view) (view)) in a buffer (CPU) that cannot run the operation (ROPE)

Note that if you are using Quantized KV, not all backends support it!
CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_synchronize at D:\a\koboldcpp\koboldcpp\ggml\src\ggml-cuda\ggml-cuda.cu:2424
  cudaStreamSynchronize(cuda_ctx->stream())
D:\a\koboldcpp\koboldcpp\ggml\src\ggml-cuda\ggml-cuda.cu:75: CUDA error

Enabling/disabling MMQ isn't affecting anything.

Additional Information:
My hardware:
CPU: i3-10105
RAM: 32 GB DDR4
GPU: RTX 3050 (GA106) 8 GB

Models that causing such an issue (but don't cause that on 1.79):

  • DeepSeek-BlackRoot-R1-Distill-Llama-3.1-8B-D_AU-Q4_k_m
  • Llama-3-8B-Stroganoff-4.0-Q5_K_M
  • LLAMA-3_8B_Unaligned_BETA.Q5_K_M

Models that don't cause it on 1.80+ or cause it very rarely:

  • L3-Super-Nova-RP-8B-V1-exp8-11-D_AU-Q4_k_m
  • Sailor2-8B-Chat-Uncen-q4_k_m

Full log (with removed context)

***
Welcome to KoboldCpp - Version 1.85.1
For command line arguments, please refer to --help
***
Auto Selected CUDA Backend...

Initializing dynamic library: koboldcpp_cublas.dll
==========
Namespace(admin=False, admindir='', adminpassword='', analyze='', benchmark=None, blasbatchsize=512, blasthreads=8, chatcompletionsadapter=None, config=None, contextsize=12288, debugmode=0, draftamount=8, draftgpulayers=999, draftgpusplit=None, draftmodel=None, exportconfig='', exporttemplate='', failsafe=False, flashattention=True, forceversion=0, foreground=False, gpulayers=60, highpriority=False, hordeconfig=None, hordegenlen=0, hordekey='', hordemaxctx=0, hordemodelname='', hordeworkername='', host='', ignoremissing=False, launch=False, lora=None, mmproj=None, model=[], model_param='F:/Torrent/_llm/DeepSeek-BlackRoot-R1-Distill-Llama-3.1-8B-D_AU-Q4_k_m.gguf', moeexperts=-1, multiplayer=False, multiuser=1, noavx2=False, noblas=False, nocertify=False, nofastforward=False, nommap=False, nomodel=False, noshift=False, onready='', password=None, port=5001, port_param=5001, preloadstory=None, prompt='', promptlimit=100, quantkv=0, quiet=False, remotetunnel=False, ropeconfig=[0.0, 10000.0], savedatafile=None, sdclamped=0, sdclipg='', sdclipl='', sdconfig=None, sdlora='', sdloramult=1.0, sdmodel='', sdnotile=False, sdquant=False, sdt5xxl='', sdthreads=3, sdvae='', sdvaeauto=False, showgui=False, skiplauncher=False, smartcontext=False, ssl=None, tensor_split=None, threads=8, ttsgpu=False, ttsmaxlen=4096, ttsmodel='', ttsthreads=0, ttswavtokenizer='', unpack='', useclblast=None, usecpu=False, usecublas=['lowvram', '0', 'nommq'], usemlock=False, usemmap=False, usevulkan=None, version=False, visionmaxres=1024, websearch=False, whispermodel='')
==========
Loading Text Model: F:\_llm\DeepSeek-BlackRoot-R1-Distill-Llama-3.1-8B-D_AU-Q4_k_m.gguf

The reported GGUF Arch is: llama
Arch Category: 0

---
Identified as GGUF model: (ver 6)
Attempting to Load...
---
Using automatic RoPE scaling for GGUF. If the model has custom RoPE settings, they'll be used directly instead!
System Info: AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | AMX_INT8 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
---
Initializing CUDA/HIP, please wait, the following step may take a few minutes for first launch...
---
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3050, compute capability 8.6, VMM: yes
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3050) - 7196 MiB free
llama_model_loader: loaded meta data with 33 key-value pairs and 292 tensors from F:\_llm\DeepSeek-BlackRoot-R1-Distill-'cV↔@?print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 4.58 GiB (4.89 BPW)
init_tokenizer: initializing tokenizer for type 2
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 4096
print_info: n_layer          = 32
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 14336
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 8B
print_info: model params     = 8.03 B
print_info: general.name     = 8b Deepseek
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<п??beginв-?ofв-?sentenceп??>'
print_info: EOS token        = 128001 '<п??endв-?ofв-?sentenceп??>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: PAD token        = 128001 '<п??endв-?ofв-?sentenceп??>'
print_info: LF token         = 198 'Д?'
print_info: EOG token        = 128001 '<п??endв-?ofв-?sentenceп??>'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = false)
load_tensors: relocated tensors: 1 of 323
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors:          CPU model buffer size =   281.81 MiB
load_tensors:        CUDA0 model buffer size =  4403.49 MiB
load_all_data: no device found for buffer type CPU for async uploads
load_all_data: using async uploads for device CUDA0, buffer type CUDA0, backend CUDA0
........................................................................................
Automatic RoPE Scaling: Using (scale:1.000, base:500000.0).
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 12544
llama_init_from_model: n_ctx_per_seq = 12544
llama_init_from_model: n_batch       = 512
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 1
llama_init_from_model: freq_base     = 500000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (12544) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 12544, offload = 0, type_k = 'f16', type_v = 'f16', n_layer = 32, can_shift = 1
llama_kv_cache_init:        CPU KV buffer size =  1568.00 MiB
llama_init_from_model: KV self size  = 1568.00 MiB, K (f16):  784.00 MiB, V (f16):  784.00 MiB
llama_init_from_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_init_from_model:      CUDA0 compute buffer size =   266.50 MiB
llama_init_from_model:  CUDA_Host compute buffer size =    32.51 MiB
llama_init_from_model: graph nodes  = 903
llama_init_from_model: graph splits = 66
Load Text Model OK: True
Embedded KoboldAI Lite loaded.
Embedded API docs loaded.
======
Active Modules: TextGeneration
Inactive Modules: ImageGeneration VoiceRecognition MultimodalVision NetworkMultiplayer ApiKeyPassword WebSearchProxy TextToSpeech AdminControl
Enabled APIs: KoboldCppApi OpenAiApi OllamaApi
Starting Kobold API on port 5001 at http://localhost:5001/api/
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/
======
Please connect to custom endpoint at http://localhost:5001

Input: {"n": 1, "max_context_length": 12288, "max_length": 512, "rep_pen": 1.1, "temperature": 0.61, "top_p": 0.92, "top_k": 100, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 320, "rep_pen_slope": 0.7, "sampler_order": [6, 0, 1, 3, 4, 2, 5], "memory": (+53833 chars)

Processing Prompt [BLAS] (11776 / 11776 tokens)
Generating (3 / 512 tokens)
(Stop sequence triggered: \n> )
[17:54:52] CtxLimit:11779/12288, Amt:3/512, Init:0.02s, Process:19.99s (1.7ms/T = 589.07T/s), Generate:0.95s (316.7ms/T = 3.16T/s), Total:20.94s (0.14T/s)
Output:


Input: {"n": 1, "max_context_length": 12288, "max_length": 512, "rep_pen": 1.1, "temperature": 0.61, "top_p": 0.92, "top_k": 100, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 320, "rep_pen_slope": 0.7, "sampler_order": [6, 0, 1, 3, 4, 2, 5], "memory": (+53828 chars)

[Context Shifting: Erased 96 tokens at position 2993]
Processing Prompt [BLAS] (97 / 97 tokens)
Caution: pre-allocated tensor (cache_k_l0 (view) (view)) in a buffer (CPU) that cannot run the operation (ROPE)

Note that if you are using Quantized KV, not all backends support it!
CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_synchronize at D:\a\koboldcpp\koboldcpp\ggml\src\ggml-cuda\ggml-cuda.cu:2424
  cudaStreamSynchronize(cuda_ctx->stream())
D:\a\koboldcpp\koboldcpp\ggml\src\ggml-cuda\ggml-cuda.cu:75: CUDA error

What could be a reason for that?

@ivanesons
Copy link
Author

Checked also the model NousResearch_DeepHermes-3-Llama-3-8B-Preview-Q4_K_M, the problem is the same:

CUDA error: an illegal memory access was encountered
  current device: 0, in function ggml_backend_cuda_synchronize at D:\a\koboldcpp\koboldcpp\ggml\src\ggml-cuda\ggml-cuda.cu:2424
  cudaStreamSynchronize(cuda_ctx->stream())
D:\a\koboldcpp\koboldcpp\ggml\src\ggml-cuda\ggml-cuda.cu:75: CUDA error

@LostRuins LostRuins added the bug Something isn't working label Mar 13, 2025
@LostRuins
Copy link
Owner

Tried the model you linked and it worked fine for me. I'm not too sure what's wrong.
Could you check if your GPU drivers are up to date?
Also this configuration seem to be using almost exceeding 8gb vram. Vram requirements have increased slightly since 1.79. Could you try reducing the number of offloaded layers to test?

@ivanesons
Copy link
Author

ivanesons commented Mar 13, 2025

your GPU drivers are up to date?

Yes, i'm updating it right away as it's available. My current driver is NVidia Game Ready 572.70, released March 5 of 2025 :)

reducing the number of offloaded layers to test?

Just tried it with NousResearch_DeepHermes-3-Llama-3-8B-Preview-Q4_K_M — no, there's no any difference, it crashes after first or second stop and pressing "generate more". 33 and 20 layers crashing it the same way, but the string before the crash message is a bit different. With 33 layers, it writes:

Caution: pre-allocated tensor (cache_k_l0 (view) (view)) in a buffer (CPU) that cannot run the operation (ROPE)

With 20 layers:

Caution: pre-allocated tensor (cache_k_l12 (view) (view)) in a buffer (CPU) that cannot run the operation (ROPE)

Also, i tested it with L3-Super-Nova-RP-8B-V1-exp8-11-D_AU-Q4_k_m (40 layers to GPU) — works great, it endured 30 stop-starts (and probably it will endure more, i just didn't try it anymore). So, it seems, even if related to the lack of the memory, it still is very depending on the model.

Here's the file i tested it on, if so.

saved_story - failing test.json

@LostRuins
Copy link
Owner

Hmm thats very strange indeed. I suppose in Vulkan everything works fine though?

@ivanesons
Copy link
Author

Checked it with Vulkan — it doesn't crashes, but processing the context is VERY long. On CUDA such long context takes 15-40 seconds to proceed, but with Vulkan it took 1 HOUR 7 minutes! And the generating of the text itself is very slow, 1 word per 3-5 seconds, while on CUDA it's 1-2 words per second.
I waited for it to test, but obviously it's hard to use it this way.

If it's impossible to fix in the future versions, i would be happy to at least get information why some models crashes it while other don't — to be able to know whether particular model will work fine before downloading it.

@LostRuins
Copy link
Owner

On Vulkan, you selected the same GPU (the RTX 3050) from the GUI?
Something is really weird with your system.

@ivanesons
Copy link
Author

Of course, right after selecting Vulkan in the "presets" list, it selects my only GPU (3050) in the "GPU id" list.

Right now I just tested also CUDA with "flash attention" disabled — context processing was much slower than with that option enabled — it took about a 6 minutes to proceed. But it didn't affected trashing: i stopped it after generating of the first phrase and pressed "generate more", and right after that there was the same error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants