Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] def get_nvgpu_memory_capacity() causes crash on NVIDIA H100 MIG #2933

Open
5 tasks done
dsingal0 opened this issue Jan 17, 2025 · 5 comments
Open
5 tasks done
Assignees

Comments

@dsingal0
Copy link

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

For NVIDIA H100 MIG running with NVIDIA GPU Operator in k8s, nvidia-smi doesn't show available memory, and either NVML or DCGM APIs need to be used. In that case get_nvgpu_memory_capacity() causes a crash with the error log:
Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/sgl-workspace/sglang/python/sglang/launch_server.py", line 11, in <module> server_args = prepare_server_args(sys.argv[1:]) File "/sgl-workspace/sglang/python/sglang/srt/server_args.py", line 918, in prepare_server_args server_args = ServerArgs.from_cli_args(raw_args) File "/sgl-workspace/sglang/python/sglang/srt/server_args.py", line 870, in from_cli_args return cls(**{attr: getattr(args, attr) for attr in attrs}) File "<string>", line 92, in __init__ File "/sgl-workspace/sglang/python/sglang/srt/server_args.py", line 175, in __post_init__ gpu_mem = get_nvgpu_memory_capacity() File "/sgl-workspace/sglang/python/sglang/srt/utils.py", line 955, in get_nvgpu_memory_capacity raise ValueError("No GPU memory values found.") ValueError: No GPU memory values found.

Reproduction

python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.5-13b --served-model-name llava --tokenizer-path llava-hf/llava-1.5-13b-hf --chat-template vicuna_v1.1 --port 30000 --trust-remote-code --port 8000
on an H100 MIG or mocking nvidia-smi being unable to report available memory.

Environment

Python: 3.10.16 (main, Dec 4 2024, 08:53:37) [GCC 9.4.0]

CUDA available: True

GPU 0: NVIDIA H100 80GB HBM3 MIG 3g.40gb

GPU 0 Compute Capability: 9.0

CUDA_HOME: /usr/local/cuda

NVCC: Cuda compilation tools, release 12.4, V12.4.131

CUDA Driver Version: 550.90.07

PyTorch: 2.5.1+cu124

flashinfer: 0.1.6+cu124torch2.4

triton: 3.1.0

transformers: 4.48.0

torchao: 0.7.0

numpy: 1.26.4

aiohttp: 3.11.11

fastapi: 0.115.6

hf_transfer: 0.1.9

huggingface_hub: 0.27.1

interegular: 0.3.3

modelscope: 1.22.1

orjson: 3.10.14

packaging: 24.2

psutil: 6.1.1

pydantic: 2.10.5

multipart: 0.0.20

zmq: 26.2.0

uvicorn: 0.34.0

uvloop: 0.21.0

vllm: 0.6.4.post1

openai: 1.59.7

anthropic: 0.43.0

decord: 0.6.0

NVIDIA Topology:

GPU0	CPU Affinity	NUMA Affinity	GPU NUMA ID

GPU0 X 48-95,144-191 1 N/A

Legend:

X = Self

SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)

NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node

PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)

PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)

PIX = Connection traversing at most a single PCIe bridge

NV# = Connection traversing a bonded set of # NVLinks

Hypervisor vendor: KVM

ulimit soft: 1048576

@zhaochenyang20
Copy link
Collaborator

This is the error in your local environment. Could you ask for help from NVIDIA community?

@dsingal0
Copy link
Author

It is this issue: NVIDIA/nvidia-container-toolkit#842
which requires the container toolkit to be run with elevated privileges, which isn't feasible on multi-host services where multiple customer workloads might be on the same node.
Specifically the output of that nvidia-smi command in such an environment is
[Insufficient Permissions]

@zhaochenyang20
Copy link
Collaborator

@zhyncs who can help on this?

@tomheno
Copy link

tomheno commented Feb 4, 2025

Same on H200 MIG

@tomheno
Copy link

tomheno commented Feb 6, 2025

Opened PR to provide a workaround, tested on H200 with success

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants