[Bug] def get_nvgpu_memory_capacity() causes crash on NVIDIA H100 MIG #2933

dsingal0 · 2025-01-17T05:31:28Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
5. Please use English, otherwise it will be closed.

Describe the bug

For NVIDIA H100 MIG running with NVIDIA GPU Operator in k8s, nvidia-smi doesn't show available memory, and either NVML or DCGM APIs need to be used. In that case get_nvgpu_memory_capacity() causes a crash with the error log:
Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/sgl-workspace/sglang/python/sglang/launch_server.py", line 11, in <module> server_args = prepare_server_args(sys.argv[1:]) File "/sgl-workspace/sglang/python/sglang/srt/server_args.py", line 918, in prepare_server_args server_args = ServerArgs.from_cli_args(raw_args) File "/sgl-workspace/sglang/python/sglang/srt/server_args.py", line 870, in from_cli_args return cls(**{attr: getattr(args, attr) for attr in attrs}) File "<string>", line 92, in __init__ File "/sgl-workspace/sglang/python/sglang/srt/server_args.py", line 175, in __post_init__ gpu_mem = get_nvgpu_memory_capacity() File "/sgl-workspace/sglang/python/sglang/srt/utils.py", line 955, in get_nvgpu_memory_capacity raise ValueError("No GPU memory values found.") ValueError: No GPU memory values found.

Reproduction

python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.5-13b --served-model-name llava --tokenizer-path llava-hf/llava-1.5-13b-hf --chat-template vicuna_v1.1 --port 30000 --trust-remote-code --port 8000
on an H100 MIG or mocking nvidia-smi being unable to report available memory.

Environment

Python: 3.10.16 (main, Dec 4 2024, 08:53:37) [GCC 9.4.0]

CUDA available: True

GPU 0: NVIDIA H100 80GB HBM3 MIG 3g.40gb

GPU 0 Compute Capability: 9.0

CUDA_HOME: /usr/local/cuda

NVCC: Cuda compilation tools, release 12.4, V12.4.131

CUDA Driver Version: 550.90.07

PyTorch: 2.5.1+cu124

flashinfer: 0.1.6+cu124torch2.4

triton: 3.1.0

transformers: 4.48.0

torchao: 0.7.0

numpy: 1.26.4

aiohttp: 3.11.11

fastapi: 0.115.6

hf_transfer: 0.1.9

huggingface_hub: 0.27.1

interegular: 0.3.3

modelscope: 1.22.1

orjson: 3.10.14

packaging: 24.2

psutil: 6.1.1

pydantic: 2.10.5

multipart: 0.0.20

zmq: 26.2.0

uvicorn: 0.34.0

uvloop: 0.21.0

vllm: 0.6.4.post1

openai: 1.59.7

anthropic: 0.43.0

decord: 0.6.0

NVIDIA Topology:

GPU0	CPU Affinity	NUMA Affinity	GPU NUMA ID

GPU0 X 48-95,144-191 1 N/A

Legend:

X = Self

SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)

NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node

PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)

PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)

PIX = Connection traversing at most a single PCIe bridge

NV# = Connection traversing a bonded set of # NVLinks

Hypervisor vendor: KVM

ulimit soft: 1048576

The text was updated successfully, but these errors were encountered:

zhaochenyang20 · 2025-01-21T22:08:47Z

This is the error in your local environment. Could you ask for help from NVIDIA community?

dsingal0 · 2025-01-21T22:59:50Z

It is this issue: NVIDIA/nvidia-container-toolkit#842
which requires the container toolkit to be run with elevated privileges, which isn't feasible on multi-host services where multiple customer workloads might be on the same node.
Specifically the output of that nvidia-smi command in such an environment is
[Insufficient Permissions]

zhaochenyang20 · 2025-01-21T23:02:22Z

@zhyncs who can help on this?

tomheno · 2025-02-04T21:05:36Z

Same on H200 MIG

tomheno · 2025-02-06T17:54:44Z

Opened PR to provide a workaround, tested on H200 with success

zhyncs self-assigned this Feb 4, 2025

zhyncs added the high priority label Feb 4, 2025

tomheno mentioned this issue Feb 6, 2025

fix(mig): fallback gpu_memory_total value #3353

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] def get_nvgpu_memory_capacity() causes crash on NVIDIA H100 MIG #2933

[Bug] def get_nvgpu_memory_capacity() causes crash on NVIDIA H100 MIG #2933

dsingal0 commented Jan 17, 2025

zhaochenyang20 commented Jan 21, 2025

dsingal0 commented Jan 21, 2025

zhaochenyang20 commented Jan 21, 2025

tomheno commented Feb 4, 2025

tomheno commented Feb 6, 2025

[Bug] def get_nvgpu_memory_capacity() causes crash on NVIDIA H100 MIG #2933

[Bug] def get_nvgpu_memory_capacity() causes crash on NVIDIA H100 MIG #2933

Comments

dsingal0 commented Jan 17, 2025

Checklist

Describe the bug

Reproduction

Environment

zhaochenyang20 commented Jan 21, 2025

dsingal0 commented Jan 21, 2025

zhaochenyang20 commented Jan 21, 2025

tomheno commented Feb 4, 2025

tomheno commented Feb 6, 2025