-
-
Notifications
You must be signed in to change notification settings - Fork 7.8k
[Bug]: Unable to Run W4A16 GPTQ Quantized Models #19098
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Tried executing DEBUG 06-03 19:16:43 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.39.mlp.down_proj
INFO 06-03 19:16:43 [backends.py:37] Using InductorAdaptor
DEBUG 06-03 19:16:43 [config.py:4632] enabled custom ops: Counter()
DEBUG 06-03 19:16:43 [config.py:4634] disabled custom ops: Counter({'rms_norm': 131, 'silu_and_mul': 40, 'gelu_and_mul': 1, 'rotary_embedding': 1})
DEBUG 06-03 19:16:43 [config.py:4632] enabled custom ops: Counter()
DEBUG 06-03 19:16:43 [config.py:4634] disabled custom ops: Counter({'rms_norm': 131, 'silu_and_mul': 40, 'gelu_and_mul': 1, 'rotary_embedding': 1})
INFO 06-03 19:16:43 [weight_utils.py:291] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:01, 2.58it/s]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:01<00:01, 1.51it/s]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:02<00:00, 1.29it/s]
DEBUG 06-03 19:16:46 [utils.py:169] Loaded weight lm_head.weight with shape torch.Size([131072, 5120])
DEBUG 06-03 19:16:46 [utils.py:169] Loaded weight multi_modal_projector.linear_1.weight with shape torch.Size([5120, 1024])
DEBUG 06-03 19:16:46 [utils.py:169] Loaded weight multi_modal_projector.linear_2.weight with shape torch.Size([5120, 5120])
DEBUG 06-03 19:16:46 [utils.py:169] Loaded weight multi_modal_projector.norm.weight with shape torch.Size([1024])
DEBUG 06-03 19:16:46 [utils.py:169] Loaded weight multi_modal_projector.patch_merger.merging_layer.weight with shape torch.Size([1024, 4096])
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00, 1.35it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00, 1.41it/s]
INFO 06-03 19:16:46 [default_loader.py:271] Loading weights took 2.94 seconds
INFO 06-03 19:16:47 [gpu_model_runner.py:1593] Model loading took 14.0463 GiB and 3.847190 seconds
INFO 06-03 19:16:47 [gpu_model_runner.py:1913] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 3 image items of the maximum feature size.
...
INFO: Started server process [3228240]
INFO: Waiting for application startup.
INFO: Application startup complete. Could you rerun with |
This error occurs because the official vLLM Docker images (up to v0.8.5) do not include CUDA architectures for new GPUs (like the RTX 5090, compute capability sm_120). vLLM 0.9.0 adds support, but you must build the Docker image yourself with the correct CUDA arch flags until an official image is released. Specifically, set To resolve, build the Docker image with the appropriate build arguments. Example build command: DOCKER_BUILDKIT=1 sudo docker build . --target vllm-openai \
--tag myvllm --file docker/Dockerfile \
--build-arg max_jobs=4 \
--build-arg nvcc_threads=1 \
--build-arg torch_cuda_arch_list="12.0 12.1" \
--build-arg RUN_WHEEL_CHECK=false Then run the container as usual. For more details and troubleshooting, see the discussion in vLLM issue #16901 and vLLM issue #17739. Or you can try compile the kernels by yourself directly. |
Your current environment
The output of
python collect_env.py
🐛 Describe the bug
I've been attempting to run vLLM with a GPTQ quantized model on an RTX5090 Laptop GPU and have been running into the following stacktrace. I believe that it is the quantization causing the issue as I've tried multiple other GPTQ quantized models and faced the same issue while unquantized models load properly. If anyone else has faced this issue and has any suggestions any help would be greatly appreciated.
Command used:
vllm serve ISTA-DASLab/Mistral-Small-3.1-24B-Instruct-2503-GPTQ-4b-128g --max-model-len 8192 --max-seq-len 8192
Stacktrace:
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: