Closed as not planned
Closed as not planned
Description
Your current environment
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.26.3
Libc version: glibc-2.31
Python version: 3.11.9 (main, Apr 6 2024, 17:59:24) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-94-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3
GPU 2: NVIDIA H100 80GB HBM3
GPU 3: NVIDIA H100 80GB HBM3
GPU 4: NVIDIA H100 80GB HBM3
GPU 5: NVIDIA H100 80GB HBM3
GPU 6: NVIDIA H100 80GB HBM3
GPU 7: NVIDIA H100 80GB HBM3
Nvidia driver version: 535.129.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 48 bits physical, 57 bits virtual
CPU(s): 176
On-line CPU(s) list: 0-175
Thread(s) per core: 2
Core(s) per socket: 44
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 143
Model name: Intel(R) Xeon(R) Platinum 8468V
Stepping: 8
CPU MHz: 2400.000
BogoMIPS: 4800.00
Virtualization: VT-x
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 2.8 MiB
L1i cache: 2.8 MiB
L2 cache: 352 MiB
L3 cache: 32 MiB
NUMA node0 CPU(s): 0-87
NUMA node1 CPU(s): 88-175
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Unknown: No mitigations
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Mitigation; TSX disabled
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd arat avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b fsrm md_clear serialize tsxldtrk amx_bf16 avx512_fp16 amx_tile amx_int8 arch_capabilities
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] torchvision==0.18.0
[pip3] transformers==4.42.3
[pip3] triton==2.3.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.0.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 SYS SYS SYS SYS SYS PHB PHB PHB PHB 0-87 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 SYS SYS SYS SYS SYS PHB PHB PHB PHB 0-87 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 SYS SYS SYS SYS SYS PHB PHB PHB PHB 0-87 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 SYS SYS SYS SYS SYS PHB PHB PHB PHB 0-87 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 PHB PHB PHB PHB SYS SYS SYS SYS SYS 88-175 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 PHB PHB PHB PHB SYS SYS SYS SYS SYS 88-175 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 PHB PHB PHB PHB SYS SYS SYS SYS SYS 88-175 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X PHB PHB PHB PHB SYS SYS SYS SYS SYS 88-175 1 N/A
NIC0 SYS SYS SYS SYS PHB PHB PHB PHB X PHB PHB PHB SYS SYS SYS SYS SYS
NIC1 SYS SYS SYS SYS PHB PHB PHB PHB PHB X PHB PHB SYS SYS SYS SYS SYS
NIC2 SYS SYS SYS SYS PHB PHB PHB PHB PHB PHB X PHB SYS SYS SYS SYS SYS
NIC3 SYS SYS SYS SYS PHB PHB PHB PHB PHB PHB PHB X SYS SYS SYS SYS SYS
NIC4 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X SYS SYS SYS SYS
NIC5 PHB PHB PHB PHB SYS SYS SYS SYS SYS SYS SYS SYS SYS X PHB PHB PHB
NIC6 PHB PHB PHB PHB SYS SYS SYS SYS SYS SYS SYS SYS SYS PHB X PHB PHB
NIC7 PHB PHB PHB PHB SYS SYS SYS SYS SYS SYS SYS SYS SYS PHB PHB X PHB
NIC8 PHB PHB PHB PHB SYS SYS SYS SYS SYS SYS SYS SYS SYS PHB PHB PHB X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8
🐛 Describe the bug
I'm running the examples/offline_inference.py
script with Mixtral 8x7b and FP8 quantization (tp=2), i.e., the following script:
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="mistralai/Mixtral-8x7B-Instruct-v0.1",
tensor_parallel_size=2,
quantization="fp8"
)
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
I'm on main on the latest commit as of now (47f0954).
I get the following error. I only get it if I enable FP8 quantization (otherwise, the script runs fine).
Failed: Cuda error /mnt/workdisk/ferdiko/exploratory/vllm/csrc/custom_all_reduce.cuh:330 'an illegal memory access was encountered'
[rank1]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
Failed: Cuda error /mnt/workdisk/ferdiko/exploratory/vllm/csrc/custom_all_reduce.cuh:330 'an illegal memory access was encountered'
[rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
/usr/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
If I run Llama3-8B, the script runs fine even with FP8 quantization. However, I still see the following warning:
[rank0]:[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]
/usr/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '