Try to run DeepSeek-Coder-V2-Lite with 16G GPU memory and get Out of memory error #4156

Lambda14 · 2025-04-15T17:16:57Z

Hello, I'm trying to run tabby with model DeepSeek-Coder-V2-Lite on windows using the command: .\tabby.exe serve --model DeepSeek-Coder-V2-Lite --chat-model Qwen2-1.5B-Instruct --device cuda

and I get memory allocation error: allocating 15712.47 MiB on device 0: cudaMalloc failed: out of memory
this is happening on a server with a tesla p100 video card.

however on another computer with RTX 3070 using docker the model works but very slow

Why is this happening?

zwpaper · 2025-04-18T02:19:14Z

Hello @Lambda14, I have verified that the DeepSeek-Coder-V2-Lite possesses 16B parameters. Consequently, 16GB of memory is adequate, likely leading to Out-Of-Memory errors.

This seems to be working as expected, maybe you should use a model with fewer parameters.

Lambda14 · 2025-04-18T06:54:54Z

@zwpaper Hello, thanks for the reply, but still, why this error does not occur when running the model via docker?

Lambda14 · 2025-04-19T11:40:51Z

Ok, now i tried to run model Qwen2.5-Coder-3B without chat model
it runs successfully, but when i send a request, i get an error

2025-04-19T11:35:24.909336Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:124: llama-server <completion> exited with status code -1073740791, args: `Command { std: "C:\\Users\\leo\\Desktop\\tabby_x86_64-windows-msvc-cuda124\\llama-server.exe" "-m" "C:\\Users\\leo\\.tabby\\models\\TabbyML\\Qwen2.5-Coder-3B\\ggml\\model-00001-of-00001.gguf" "--cont-batching" "--port" "30889" "-np" "1" "--ctx-size" "4096" "-ngl" "9999", kill_on_drop: true }`
Recent llama-cpp errors:

load_tensors: offloading output layer to GPU
load_tensors: offloaded 37/37 layers to GPU
load_tensors:        CUDA0 model buffer size =  3127.61 MiB
load_tensors:   CPU_Mapped model buffer size =   315.30 MiB
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 4096
llama_init_from_model: n_ctx_per_seq = 4096
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 1000000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 36, can_shift = 1
llama_kv_cache_init:      CUDA0 KV buffer size =   144.00 MiB
llama_init_from_model: KV self size  =  144.00 MiB, K (f16):   72.00 MiB, V (f16):   72.00 MiB
llama_init_from_model:  CUDA_Host  output buffer size =     0.58 MiB
llama_init_from_model:      CUDA0 compute buffer size =   300.75 MiB
llama_init_from_model:  CUDA_Host compute buffer size =    12.01 MiB
llama_init_from_model: graph nodes  = 1266
llama_init_from_model: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 4096
main: model loaded
main: chat template, chat_template: {%- if tools %}
    {{- '<|im_start|>system\n' }}
    {%- if messages[0]['role'] == 'system' %}
        {{- messages[0]['content'] }}
    {%- else %}
        {{- 'You are a helpful assistant.' }}
    {%- endif %}
    {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }}
    {%- for tool in tools %}
        {{- "\n" }}
        {{- tool | tojson }}
    {%- endfor %}
    {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }}
{%- else %}
    {%- if messages[0]['role'] == 'system' %}
        {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}
    {%- else %}
        {{- '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n' }}
    {%- endif %}
{%- endif %}
{%- for message in messages %}
    {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}
        {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}
    {%- elif message.role == "assistant" %}
        {{- '<|im_start|>' + message.role }}
        {%- if message.content %}
            {{- '\n' + message.content }}
        {%- endif %}
        {%- for tool_call in message.tool_calls %}
            {%- if tool_call.function is defined %}
                {%- set tool_call = tool_call.function %}
            {%- endif %}
            {{- '\n<tool_call>\n{"name": "' }}
            {{- tool_call.name }}
            {{- '", "arguments": ' }}
            {{- tool_call.arguments | tojson }}
            {{- '}\n</tool_call>' }}
        {%- endfor %}
        {{- '<|im_end|>\n' }}
    {%- elif message.role == "tool" %}
        {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}
            {{- '<|im_start|>user' }}
        {%- endif %}
        {{- '\n<tool_response>\n' }}
        {{- message.content }}
        {{- '\n</tool_response>' }}
        {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}
            {{- '<|im_end|>\n' }}
        {%- endif %}
    {%- endif %}
{%- endfor %}
{%- if add_generation_prompt %}
    {{- '<|im_start|>assistant\n' }}
{%- endif %}
, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on http://127.0.0.1:30889 - starting the main loop
srv  update_slots: all slots are idle
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 4
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 4, n_tokens = 4, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 4, n_tokens = 4
D:\a\llama.cpp\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:73: CUDA error
ggml_cuda_compute_forward: MUL_MAT failed
CUDA error: unspecified launch failure
2025-04-19T11:35:24.947975Z  WARN llama_cpp_server::supervisor: crates\llama-cpp-server\src\supervisor.rs:164: Attempting to restart the llama-server...

here is my nvidia-smi output, with launched tabby

zwpaper · 2025-05-07T04:44:37Z

MUL_MAT failed is likely due to an issue in the upstream llama.cpp: ggml-org/llama.cpp#13252

zwpaper · 2025-05-15T09:31:58Z

Hi @Lambda14, What is the model of your CPU? We have encountered a failure that was caused by a CPU lacking support for certain AVX instructions.

Lambda14 changed the title ~~Out of memory~~ Out of memory error Apr 15, 2025

zwpaper changed the title ~~Out of memory error~~ Try to run DeepSeek-Coder-V2-Lite with 16G GPU memory and get Out of memory error Apr 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Try to run DeepSeek-Coder-V2-Lite with 16G GPU memory and get Out of memory error #4156

Try to run DeepSeek-Coder-V2-Lite with 16G GPU memory and get Out of memory error #4156

Lambda14 commented Apr 15, 2025

zwpaper commented Apr 18, 2025

Uh oh!

Lambda14 commented Apr 18, 2025

Uh oh!

Lambda14 commented Apr 19, 2025 •

edited

Loading

Uh oh!

zwpaper commented May 7, 2025

Uh oh!

zwpaper commented May 15, 2025

Uh oh!

Try to run DeepSeek-Coder-V2-Lite with 16G GPU memory and get Out of memory error #4156

Try to run DeepSeek-Coder-V2-Lite with 16G GPU memory and get Out of memory error #4156

Comments

Lambda14 commented Apr 15, 2025

zwpaper commented Apr 18, 2025

Uh oh!

Lambda14 commented Apr 18, 2025

Uh oh!

Lambda14 commented Apr 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zwpaper commented May 7, 2025

Uh oh!

zwpaper commented May 15, 2025

Uh oh!

Lambda14 commented Apr 19, 2025 •

edited

Loading