-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA error in all versions newer than v1.79 #1412
Comments
Checked also the model NousResearch_DeepHermes-3-Llama-3-8B-Preview-Q4_K_M, the problem is the same:
|
Tried the model you linked and it worked fine for me. I'm not too sure what's wrong. |
Yes, i'm updating it right away as it's available. My current driver is NVidia Game Ready 572.70, released March 5 of 2025 :)
Just tried it with NousResearch_DeepHermes-3-Llama-3-8B-Preview-Q4_K_M — no, there's no any difference, it crashes after first or second stop and pressing "generate more". 33 and 20 layers crashing it the same way, but the string before the crash message is a bit different. With 33 layers, it writes:
With 20 layers:
Also, i tested it with L3-Super-Nova-RP-8B-V1-exp8-11-D_AU-Q4_k_m (40 layers to GPU) — works great, it endured 30 stop-starts (and probably it will endure more, i just didn't try it anymore). So, it seems, even if related to the lack of the memory, it still is very depending on the model. Here's the file i tested it on, if so. |
Hmm thats very strange indeed. I suppose in Vulkan everything works fine though? |
Checked it with Vulkan — it doesn't crashes, but processing the context is VERY long. On CUDA such long context takes 15-40 seconds to proceed, but with Vulkan it took 1 HOUR 7 minutes! And the generating of the text itself is very slow, 1 word per 3-5 seconds, while on CUDA it's 1-2 words per second. If it's impossible to fix in the future versions, i would be happy to at least get information why some models crashes it while other don't — to be able to know whether particular model will work fine before downloading it. |
On Vulkan, you selected the same GPU (the RTX 3050) from the GUI? |
Of course, right after selecting Vulkan in the "presets" list, it selects my only GPU (3050) in the "GPU id" list. Right now I just tested also CUDA with "flash attention" disabled — context processing was much slower than with that option enabled — it took about a 6 minutes to proceed. But it didn't affected trashing: i stopped it after generating of the first phrase and pressed "generate more", and right after that there was the same error. |
Describe the Issue
Starting from 1.80, most of the models causes the CUDA Error when context is long enough. It happens not instantly — the first 2-3 requests works fine, but after that happens an error with the next text in console:
Enabling/disabling MMQ isn't affecting anything.
Additional Information:
My hardware:
CPU: i3-10105
RAM: 32 GB DDR4
GPU: RTX 3050 (GA106) 8 GB
Models that causing such an issue (but don't cause that on 1.79):
Models that don't cause it on 1.80+ or cause it very rarely:
Full log (with removed context)
What could be a reason for that?
The text was updated successfully, but these errors were encountered: