Skip to content

Try to run DeepSeek-Coder-V2-Lite with 16G GPU memory and get Out of memory error #4156

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Lambda14 opened this issue Apr 15, 2025 · 5 comments

Comments

@Lambda14
Copy link

Hello, I'm trying to run tabby with model DeepSeek-Coder-V2-Lite on windows using the command: .\tabby.exe serve --model DeepSeek-Coder-V2-Lite --chat-model Qwen2-1.5B-Instruct --device cuda

and I get memory allocation error: allocating 15712.47 MiB on device 0: cudaMalloc failed: out of memory
this is happening on a server with a tesla p100 video card.

Image

however on another computer with RTX 3070 using docker the model works but very slow

Why is this happening?

@Lambda14 Lambda14 changed the title Out of memory Out of memory error Apr 15, 2025
@zwpaper zwpaper changed the title Out of memory error Try to run DeepSeek-Coder-V2-Lite with 16G GPU memory and get Out of memory error Apr 18, 2025
@zwpaper
Copy link
Member

zwpaper commented Apr 18, 2025

Hello @Lambda14, I have verified that the DeepSeek-Coder-V2-Lite possesses 16B parameters. Consequently, 16GB of memory is adequate, likely leading to Out-Of-Memory errors.

This seems to be working as expected, maybe you should use a model with fewer parameters.

@Lambda14
Copy link
Author

@zwpaper Hello, thanks for the reply, but still, why this error does not occur when running the model via docker?

@Lambda14
Copy link
Author

Lambda14 commented Apr 19, 2025

Ok, now i tried to run model Qwen2.5-Coder-3B without chat model
it runs successfully, but when i send a request, i get an error

2025-04-19T11:35:24.909336Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:124: llama-server <completion> exited with status code -1073740791, args: `Command { std: "C:\\Users\\leo\\Desktop\\tabby_x86_64-windows-msvc-cuda124\\llama-server.exe" "-m" "C:\\Users\\leo\\.tabby\\models\\TabbyML\\Qwen2.5-Coder-3B\\ggml\\model-00001-of-00001.gguf" "--cont-batching" "--port" "30889" "-np" "1" "--ctx-size" "4096" "-ngl" "9999", kill_on_drop: true }`
Recent llama-cpp errors:

load_tensors: offloading output layer to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors:        CUDA0 model buffer size =  3127.61 MiB
load_tensors:   CPU_Mapped model buffer size =   315.30 MiB
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 4096
llama_init_from_model: n_ctx_per_seq = 4096
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 1000000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 36, can_shift = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   144.00 MiB
llama_init_from_model: KV self size  =  144.00 MiB, K (f16):   72.00 MiB, V (f16):   72.00 MiB
llama_init_from_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_init_from_model:      CUDA0 compute buffer size =   300.75 MiB
llama_init_from_model:  CUDA_Host compute buffer size =    12.01 MiB
llama_init_from_model: graph nodes  = 1266
llama_init_from_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 4096
main: model loaded
main: chat template, chat_template: {%- if tools %}
    {{- '<|im_start|>system\n' }}
    {%- if messages[0]['role'] == 'system' %}
        {{- messages[0]['content'] }}
    {%- else %}
        {{- 'You are a helpful assistant.' }}
    {%- endif %}
    {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
    {%- for tool in tools %}
        {{- "\n" }}
        {{- tool | tojson }}
    {%- endfor %}
    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}
    {%- if messages[0]['role'] == 'system' %}
        {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
    {%- else %}
        {{- '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n' }}
    {%- endif %}
{%- endif %}
{%- for message in messages %}
    {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
        {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
    {%- elif message.role == "assistant" %}
        {{- '<|im_start|>' + message.role }}
        {%- if message.content %}
            {{- '\n' + message.content }}
        {%- endif %}
        {%- for tool_call in message.tool_calls %}
            {%- if tool_call.function is defined %}
                {%- set tool_call = tool_call.function %}
            {%- endif %}
            {{- '\n<tool_call>\n{"name": "' }}
            {{- tool_call.name }}
            {{- '", "arguments": ' }}
            {{- tool_call.arguments | tojson }}
            {{- '}\n</tool_call>' }}
        {%- endfor %}
        {{- '<|im_end|>\n' }}
    {%- elif message.role == "tool" %}
        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
            {{- '<|im_start|>user' }}
        {%- endif %}
        {{- '\n<tool_response>\n' }}
        {{- message.content }}
        {{- '\n</tool_response>' }}
        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
            {{- '<|im_end|>\n' }}
        {%- endif %}
    {%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
{%- endif %}
, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on http://127.0.0.1:30889 - starting the main loop
srv  update_slots: all slots are idle
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 4
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 4, n_tokens = 4, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 4, n_tokens = 4
D:\a\llama.cpp\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:73: CUDA error
ggml_cuda_compute_forward: MUL_MAT failed
CUDA error: unspecified launch failure
2025-04-19T11:35:24.947975Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:164: Attempting to restart the llama-server...

here is my nvidia-smi output, with launched tabby

Image

@zwpaper
Copy link
Member

zwpaper commented May 7, 2025

MUL_MAT failed is likely due to an issue in the upstream llama.cpp: ggml-org/llama.cpp#13252

@zwpaper
Copy link
Member

zwpaper commented May 15, 2025

Hi @Lambda14, What is the model of your CPU? We have encountered a failure that was caused by a CPU lacking support for certain AVX instructions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants