Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA Memory Allocation Failure and mlock Memory Lock Issue in llama-cpp-python #1944

Open
caiyuanhangDicp opened this issue Feb 24, 2025 · 0 comments

Comments

@caiyuanhangDicp
Copy link

I am experiencing issues while trying to launch the deepseek-v3 model with a 671B Q2_K_L quantized version on 4 x A100 (80GB) GPUs. The model fails to load, and I receive the following errors:

  1. CUDA Memory Allocation Failure:

    ggml_backend_cuda_buffer_type_alloc_buffer: allocating 204800.00 MiB on device 0: cudaMalloc failed: out of memory
    llama_kv_cache_init: failed to allocate buffer for kv cache
    llama_init_from_model: llama_kv_cache_init() failed for self-attention cache
    

    Despite having sufficient GPU memory (80GB per A100 GPU, 4 GPUs in total), the model fails to allocate memory for the required buffers.

  2. Failed to Create Llama Context:

    ValueError: Failed to create llama_context
    

    The model fails to initialize properly, resulting in a failure to create the llama_context.

Hardware and Environment:

  • Model: deepseek-v3 671B Q2_K_L quantized version
  • GPUs: 4 x A100 80GB
  • CUDA Version: 12.2
  • System Memory: 503 GiB
  • Python Version: 3.11

What I Have Tried:

  • Ensured that all GPUs have enough memory available using nvidia-smi.
  • Reduced the batch size and used the quantized model to minimize memory usage.
  • Checked for any running processes that could occupy GPU memory and killed unnecessary processes.
  • Verified that the system has sufficient available memory and swap space.

Request:

Please advise on any additional configuration or memory optimizations that can resolve this issue or if there are known compatibility problems with large models like deepseek-v3 on multiple A100 GPUs.

`ggml_backend_cuda_buffer_type_alloc_buffer: allocating 204800.00 MiB on device 0: cudaMalloc failed: out of memory
llama_kv_cache_init: failed to allocate buffer for kv cache
llama_init_from_model: llama_kv_cache_init() failed for self-attention cache
2025-02-24 15:41:42,552 xinference.core.worker 711329 ERROR    Failed to load model deepseek-v3-0
Traceback (most recent call last):
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/core/worker.py", line 926, in launch_builtin_model
    await model_ref.load()
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 231, in send
    return self._process_result_message(result)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 667, in send
    result = await self._run_coro(message.message_id, coro)
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
    return await coro
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/core/model.py", line 464, in load
    self._model.load()
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/model/llm/llama_cpp/core.py", line 144, in load
    self._llm = Llama(
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/llama_cpp/llama.py", line 393, in __init__
    internals.LlamaContext(
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/llama_cpp/_internals.py", line 255, in __init__
    raise ValueError("Failed to create llama_context")
    ^^^^^^^^^^^^^^^^^
ValueError: [address=0.0.0.0:33277, pid=711347] Failed to create llama_context
2025-02-24 15:41:43,116 xinference.core.worker 711329 ERROR    [request afd74ed2-f282-11ef-8afd-6cb3117bb150] Leave launch_builtin_model, error: [address=0.0.0.0:33277, pid=711347] Failed to create llama_context, elapsed time: 44 s
Traceback (most recent call last):
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/core/utils.py", line 93, in wrapped
    ret = await func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/core/worker.py", line 926, in launch_builtin_model
    await model_ref.load()
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 231, in send
    return self._process_result_message(result)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 667, in send
    result = await self._run_coro(message.message_id, coro)
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
    return await coro
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/core/model.py", line 464, in load
    self._model.load()
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/model/llm/llama_cpp/core.py", line 144, in load
    self._llm = Llama(
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/llama_cpp/llama.py", line 393, in __init__
    internals.LlamaContext(
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/llama_cpp/_internals.py", line 255, in __init__
    raise ValueError("Failed to create llama_context")
    ^^^^^^^^^^^^^^^^^
ValueError: [address=0.0.0.0:33277, pid=711347] Failed to create llama_context
2025-02-24 15:41:43,133 xinference.api.restful_api 711193 ERROR    [address=0.0.0.0:33277, pid=711347] Failed to create llama_context
Traceback (most recent call last):
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/api/restful_api.py", line 1002, in launch_model
    model_uid = await (await self._get_supervisor_ref()).launch_builtin_model(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 231, in send
    return self._process_result_message(result)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 667, in send
    result = await self._run_coro(message.message_id, coro)
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
    return await coro
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/core/supervisor.py", line 1190, in launch_builtin_model
    await _launch_model()
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/core/supervisor.py", line 1125, in _launch_model
    subpool_address = await _launch_one_model(
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/core/supervisor.py", line 1083, in _launch_one_model
    subpool_address = await worker_ref.launch_builtin_model(
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 231, in send
    return self._process_result_message(result)
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 667, in send
    result = await self._run_coro(message.message_id, coro)
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
    return await coro
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/core/utils.py", line 93, in wrapped
    ret = await func(*args, **kwargs)
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/core/worker.py", line 926, in launch_builtin_model
    await model_ref.load()
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 231, in send
    return self._process_result_message(result)
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/context.py", line 102, in _process_result_message
    raise message.as_instanceof_cause()
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 667, in send
    result = await self._run_coro(message.message_id, coro)
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/backends/pool.py", line 370, in _run_coro
    return await coro
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xoscar/api.py", line 384, in __on_receive__
    return await super().__on_receive__(message)  # type: ignore
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 558, in __on_receive__
    raise ex
  File "xoscar/core.pyx", line 520, in xoscar.core._BaseActor.__on_receive__
    async with self._lock:
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 521, in xoscar.core._BaseActor.__on_receive__
    with debug_async_timeout('actor_lock_timeout',
    ^^^^^^^^^^^^^^^^^
  File "xoscar/core.pyx", line 526, in xoscar.core._BaseActor.__on_receive__
    result = await result
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/core/model.py", line 464, in load
    self._model.load()
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/xinference/model/llm/llama_cpp/core.py", line 144, in load
    self._llm = Llama(
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/llama_cpp/llama.py", line 393, in __init__
    internals.LlamaContext(
    ^^^^^^^^^^^^^^^^^
  File "/home/rootroot/anaconda3/envs/Xinference/lib/python3.11/site-packages/llama_cpp/_internals.py", line 255, in __init__
    raise ValueError("Failed to create llama_context")
    ^^^^^^^^^^^^^^^^^
ValueError: [address=0.0.0.0:33277, pid=711347] Failed to create llama_context`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant