Skip to content

[Bug]: Unable to Run W4A16 GPTQ Quantized Models #19098

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
mchambrec opened this issue Jun 3, 2025 · 2 comments
Closed
1 task done

[Bug]: Unable to Run W4A16 GPTQ Quantized Models #19098

mchambrec opened this issue Jun 3, 2025 · 2 comments
Labels
bug Something isn't working

Comments

@mchambrec
Copy link

Your current environment

The output of python collect_env.py
==============================
        System Info
==============================
OS                           : Ubuntu 24.04.2 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.7.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Feb  4 2025, 14:48:35) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-6.11.0-26-generic-x86_64-with-glibc2.39

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : Could not collect
CUDA_MODULE_LOADING set to   : LAZY
GPU models and configuration : GPU 0: NVIDIA GeForce RTX 5090 Laptop GPU
Nvidia driver version        : 570.148.08
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        42 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               24
On-line CPU(s) list:                  0-23
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Core(TM) Ultra 9 275HX
CPU family:                           6
Model:                                198
Thread(s) per core:                   1
Core(s) per socket:                   24
Socket(s):                            1
Stepping:                             2
CPU(s) scaling MHz:                   26%
CPU max MHz:                          6900.0000
CPU min MHz:                          800.0000
BogoMIPS:                             6144.00
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb intel_ppin ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdt_a rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect user_shstk avx_vnni lam wbnoinvd dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi vnmi umip pku ospke waitpkg gfni vaes vpclmulqdq rdpid bus_lock_detect movdiri movdir64b fsrm md_clear serialize arch_lbr ibt flush_l1d arch_capabilities
Virtualization:                       VT-x
L1d cache:                            768 KiB (20 instances)
L1i cache:                            1.3 MiB (20 instances)
L2 cache:                             40 MiB (12 instances)
L3 cache:                             36 MiB (1 instance)
NUMA node(s):                         1
NUMA node0 CPU(s):                    0-23
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

==============================
Versions of relevant libraries
==============================
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.3.14
[pip3] nvidia-cuda-cupti-cu12==12.8.57
[pip3] nvidia-cuda-nvrtc-cu12==12.8.61
[pip3] nvidia-cuda-runtime-cu12==12.8.57
[pip3] nvidia-cudnn-cu12==9.7.1.26
[pip3] nvidia-cufft-cu12==11.3.3.41
[pip3] nvidia-cufile-cu12==1.13.0.11
[pip3] nvidia-curand-cu12==10.3.9.55
[pip3] nvidia-cusolver-cu12==11.7.2.55
[pip3] nvidia-cusparse-cu12==12.5.7.53
[pip3] nvidia-cusparselt-cu12==0.6.3
[pip3] nvidia-nccl-cu12==2.26.2
[pip3] nvidia-nvjitlink-cu12==12.8.61
[pip3] nvidia-nvtx-cu12==12.8.55
[pip3] pyzmq==26.4.0
[pip3] torch==2.7.0+cu128
[pip3] torchaudio==2.7.0+cu128
[pip3] torchvision==0.22.0+cu128
[pip3] transformers==4.52.4
[pip3] triton==3.3.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
Neuron SDK Version           : N/A
vLLM Version                 : 0.9.0.1
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
        GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-23    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY

🐛 Describe the bug

I've been attempting to run vLLM with a GPTQ quantized model on an RTX5090 Laptop GPU and have been running into the following stacktrace. I believe that it is the quantization causing the issue as I've tried multiple other GPTQ quantized models and faced the same issue while unquantized models load properly. If anyone else has faced this issue and has any suggestions any help would be greatly appreciated.

Command used: vllm serve ISTA-DASLab/Mistral-Small-3.1-24B-Instruct-2503-GPTQ-4b-128g --max-model-len 8192 --max-seq-len 8192

Stacktrace:

DEBUG 06-03 13:24:40 [__init__.py:28] No plugins for group vllm.platform_plugins found.
DEBUG 06-03 13:24:40 [__init__.py:34] Checking if TPU platform is available.
DEBUG 06-03 13:24:40 [__init__.py:44] TPU platform is not available because: No module named 'libtpu'
DEBUG 06-03 13:24:40 [__init__.py:51] Checking if CUDA platform is available.
DEBUG 06-03 13:24:40 [__init__.py:71] Confirmed CUDA platform is available.
DEBUG 06-03 13:24:40 [__init__.py:99] Checking if ROCm platform is available.
DEBUG 06-03 13:24:40 [__init__.py:113] ROCm platform is not available because: No module named 'amdsmi'
DEBUG 06-03 13:24:40 [__init__.py:120] Checking if HPU platform is available.
DEBUG 06-03 13:24:40 [__init__.py:127] HPU platform is not available because habana_frameworks is not found.
DEBUG 06-03 13:24:40 [__init__.py:137] Checking if XPU platform is available.
DEBUG 06-03 13:24:40 [__init__.py:147] XPU platform is not available because: No module named 'intel_extension_for_pytorch'
DEBUG 06-03 13:24:40 [__init__.py:154] Checking if CPU platform is available.
DEBUG 06-03 13:24:40 [__init__.py:176] Checking if Neuron platform is available.
DEBUG 06-03 13:24:40 [__init__.py:51] Checking if CUDA platform is available.
DEBUG 06-03 13:24:40 [__init__.py:71] Confirmed CUDA platform is available.
INFO 06-03 13:24:40 [__init__.py:243] Automatically detected platform cuda.
DEBUG 06-03 13:24:41 [utils.py:143] Setting VLLM_WORKER_MULTIPROC_METHOD to 'spawn'
INFO 06-03 13:24:41 [__init__.py:31] Available plugins for group vllm.general_plugins:
INFO 06-03 13:24:41 [__init__.py:33] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
INFO 06-03 13:24:41 [__init__.py:36] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 06-03 13:24:42 [api_server.py:1289] vLLM API server version 0.9.0.1
INFO 06-03 13:24:42 [cli_args.py:300] non-default args: {'max_model_len': 8192}
INFO 06-03 13:24:49 [config.py:793] This model supports multiple tasks: {'reward', 'generate', 'score', 'embed', 'classify'}. Defaulting to 'generate'.
DEBUG 06-03 13:24:49 [arg_utils.py:1541] Setting max_num_batched_tokens to 2048 for OPENAI_API_SERVER usage context.
DEBUG 06-03 13:24:49 [arg_utils.py:1548] Setting max_num_seqs to 256 for OPENAI_API_SERVER usage context.
INFO 06-03 13:24:49 [config.py:2118] Chunked prefill is enabled with max_num_batched_tokens=2048.
DEBUG 06-03 13:24:52 [__init__.py:28] No plugins for group vllm.platform_plugins found.
DEBUG 06-03 13:24:52 [__init__.py:34] Checking if TPU platform is available.
DEBUG 06-03 13:24:52 [__init__.py:44] TPU platform is not available because: No module named 'libtpu'
DEBUG 06-03 13:24:52 [__init__.py:51] Checking if CUDA platform is available.
DEBUG 06-03 13:24:52 [__init__.py:71] Confirmed CUDA platform is available.
DEBUG 06-03 13:24:52 [__init__.py:99] Checking if ROCm platform is available.
DEBUG 06-03 13:24:52 [__init__.py:113] ROCm platform is not available because: No module named 'amdsmi'
DEBUG 06-03 13:24:52 [__init__.py:120] Checking if HPU platform is available.
DEBUG 06-03 13:24:52 [__init__.py:127] HPU platform is not available because habana_frameworks is not found.
DEBUG 06-03 13:24:52 [__init__.py:137] Checking if XPU platform is available.
DEBUG 06-03 13:24:52 [__init__.py:147] XPU platform is not available because: No module named 'intel_extension_for_pytorch'
DEBUG 06-03 13:24:52 [__init__.py:154] Checking if CPU platform is available.
DEBUG 06-03 13:24:52 [__init__.py:176] Checking if Neuron platform is available.
DEBUG 06-03 13:24:52 [__init__.py:51] Checking if CUDA platform is available.
DEBUG 06-03 13:24:52 [__init__.py:71] Confirmed CUDA platform is available.
INFO 06-03 13:24:52 [__init__.py:243] Automatically detected platform cuda.
INFO 06-03 13:24:53 [core.py:438] Waiting for init message from front-end.
DEBUG 06-03 13:24:53 [core_client.py:540] HELLO from local core engine process 0.
DEBUG 06-03 13:24:53 [core.py:445] Received init message: {'output_socket_address': 'ipc:///tmp/665e34c5-4131-4772-bd1e-24942fa89542', 'parallel_config': {'data_parallel_master_ip': '127.0.0.1', 'data_parallel_master_port': 0, 'data_parallel_size': 1}}
INFO 06-03 13:24:53 [__init__.py:31] Available plugins for group vllm.general_plugins:
INFO 06-03 13:24:53 [__init__.py:33] - lora_filesystem_resolver -> vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver
INFO 06-03 13:24:53 [__init__.py:36] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.
INFO 06-03 13:24:53 [core.py:65] Initializing a V1 LLM engine (v0.9.0.1) with config: model='ISTA-DASLab/Mistral-Small-3.1-24B-Instruct-2503-GPTQ-4b-128g', speculative_config=None, tokenizer='ISTA-DASLab/Mistral-Small-3.1-24B-Instruct-2503-GPTQ-4b-128g', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=ISTA-DASLab/Mistral-Small-3.1-24B-Instruct-2503-GPTQ-4b-128g, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level": 3, "custom_ops": ["none"], "splitting_ops": ["vllm.unified_attention", "vllm.unified_attention_with_output"], "compile_sizes": [], "inductor_compile_config": {"enable_auto_functionalized_v2": false}, "use_cudagraph": true, "cudagraph_num_of_warmups": 1, "cudagraph_capture_sizes": [512, 504, 496, 488, 480, 472, 464, 456, 448, 440, 432, 424, 416, 408, 400, 392, 384, 376, 368, 360, 352, 344, 336, 328, 320, 312, 304, 296, 288, 280, 272, 264, 256, 248, 240, 232, 224, 216, 208, 200, 192, 184, 176, 168, 160, 152, 144, 136, 128, 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 4, 2, 1], "max_capture_size": 512}
WARNING 06-03 13:24:53 [logger.py:203] VLLM_TRACE_FUNCTION is enabled. It will record every function executed by Python. This will slow down the code. It is suggested to be used for debugging hang or crashes only.
INFO 06-03 13:24:53 [logger.py:207] Trace frame log is saved to /tmp/matt/vllm/vllm-instance-cdd63/VLLM_TRACE_FUNCTION_for_process_43396_thread_136052599722112_at_2025-06-03_13:24:53.850846.log
DEBUG 06-03 13:24:53 [decorators.py:109] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.llama.LlamaModel'>: ['input_ids', 'positions', 'intermediate_tensors', 'inputs_embeds']
DEBUG 06-03 13:24:53 [decorators.py:109] Inferred dynamic dimensions for forward method of <class 'vllm.model_executor.models.llama_eagle3.LlamaModel'>: ['input_ids', 'positions', 'hidden_states']
WARNING 06-03 13:24:54 [utils.py:2671] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7bbb8ebeae40>
DEBUG 06-03 13:24:54 [config.py:4531] enabled custom ops: Counter()
DEBUG 06-03 13:24:54 [config.py:4533] disabled custom ops: Counter()
DEBUG 06-03 13:24:55 [parallel_state.py:917] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.2.0.19:59925 backend=nccl
INFO 06-03 13:24:55 [parallel_state.py:1064] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
WARNING 06-03 13:24:58 [topk_topp_sampler.py:58] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
DEBUG 06-03 13:24:58 [config.py:4531] enabled custom ops: Counter()
DEBUG 06-03 13:24:58 [config.py:4533] disabled custom ops: Counter()
INFO 06-03 13:24:58 [gpu_model_runner.py:1531] Starting to load model ISTA-DASLab/Mistral-Small-3.1-24B-Instruct-2503-GPTQ-4b-128g...
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.0.self_attn.qkv_proj
INFO 06-03 13:24:58 [compressed_tensors_wNa16.py:94] Using MacheteLinearKernel for CompressedTensorsWNA16
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.0.self_attn.o_proj
INFO 06-03 13:24:58 [cuda.py:217] Using Flash Attention backend on V1 engine.
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.0.mlp.gate_up_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.0.mlp.down_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.1.self_attn.qkv_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.1.self_attn.o_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.1.mlp.gate_up_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.1.mlp.down_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.2.self_attn.qkv_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.2.self_attn.o_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.2.mlp.gate_up_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.2.mlp.down_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.3.self_attn.qkv_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.3.self_attn.o_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.3.mlp.gate_up_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.3.mlp.down_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.4.self_attn.qkv_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.4.self_attn.o_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.4.mlp.gate_up_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.4.mlp.down_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.5.self_attn.qkv_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.5.self_attn.o_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.5.mlp.gate_up_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.5.mlp.down_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.6.self_attn.qkv_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.6.self_attn.o_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.6.mlp.gate_up_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.6.mlp.down_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.7.self_attn.qkv_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.7.self_attn.o_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.7.mlp.gate_up_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.7.mlp.down_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.8.self_attn.qkv_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.8.self_attn.o_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.8.mlp.gate_up_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.8.mlp.down_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.9.self_attn.qkv_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.9.self_attn.o_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.9.mlp.gate_up_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.9.mlp.down_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.10.self_attn.qkv_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.10.self_attn.o_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.10.mlp.gate_up_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.10.mlp.down_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.11.self_attn.qkv_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.11.self_attn.o_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.11.mlp.gate_up_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.11.mlp.down_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.12.self_attn.qkv_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.12.self_attn.o_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.12.mlp.gate_up_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.12.mlp.down_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.13.self_attn.qkv_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.13.self_attn.o_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.13.mlp.gate_up_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.13.mlp.down_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.14.self_attn.qkv_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.14.self_attn.o_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.14.mlp.gate_up_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.14.mlp.down_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.15.self_attn.qkv_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.15.self_attn.o_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.15.mlp.gate_up_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.15.mlp.down_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.16.self_attn.qkv_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.16.self_attn.o_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.16.mlp.gate_up_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.16.mlp.down_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.17.self_attn.qkv_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.17.self_attn.o_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.17.mlp.gate_up_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.17.mlp.down_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.18.self_attn.qkv_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.18.self_attn.o_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.18.mlp.gate_up_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.18.mlp.down_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.19.self_attn.qkv_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.19.self_attn.o_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.19.mlp.gate_up_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.19.mlp.down_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.20.self_attn.qkv_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.20.self_attn.o_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.20.mlp.gate_up_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.20.mlp.down_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.21.self_attn.qkv_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.21.self_attn.o_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.21.mlp.gate_up_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.21.mlp.down_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.22.self_attn.qkv_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.22.self_attn.o_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.22.mlp.gate_up_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.22.mlp.down_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.23.self_attn.qkv_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.23.self_attn.o_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.23.mlp.gate_up_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.23.mlp.down_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.24.self_attn.qkv_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.24.self_attn.o_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.24.mlp.gate_up_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.24.mlp.down_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.25.self_attn.qkv_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.25.self_attn.o_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.25.mlp.gate_up_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.25.mlp.down_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.26.self_attn.qkv_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.26.self_attn.o_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.26.mlp.gate_up_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.26.mlp.down_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.27.self_attn.qkv_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.27.self_attn.o_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.27.mlp.gate_up_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.27.mlp.down_proj
DEBUG 06-03 13:24:58 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.28.self_attn.qkv_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.28.self_attn.o_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.28.mlp.gate_up_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.28.mlp.down_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.29.self_attn.qkv_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.29.self_attn.o_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.29.mlp.gate_up_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.29.mlp.down_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.30.self_attn.qkv_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.30.self_attn.o_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.30.mlp.gate_up_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.30.mlp.down_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.31.self_attn.qkv_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.31.self_attn.o_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.31.mlp.gate_up_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.31.mlp.down_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.32.self_attn.qkv_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.32.self_attn.o_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.32.mlp.gate_up_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.32.mlp.down_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.33.self_attn.qkv_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.33.self_attn.o_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.33.mlp.gate_up_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.33.mlp.down_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.34.self_attn.qkv_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.34.self_attn.o_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.34.mlp.gate_up_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.34.mlp.down_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.35.self_attn.qkv_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.35.self_attn.o_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.35.mlp.gate_up_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.35.mlp.down_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.36.self_attn.qkv_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.36.self_attn.o_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.36.mlp.gate_up_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.36.mlp.down_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.37.self_attn.qkv_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.37.self_attn.o_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.37.mlp.gate_up_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.37.mlp.down_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.38.self_attn.qkv_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.38.self_attn.o_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.38.mlp.gate_up_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.38.mlp.down_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.39.self_attn.qkv_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.39.self_attn.o_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.39.mlp.gate_up_proj
DEBUG 06-03 13:24:59 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.39.mlp.down_proj
INFO 06-03 13:24:59 [backends.py:35] Using InductorAdaptor
DEBUG 06-03 13:24:59 [config.py:4531] enabled custom ops: Counter()
DEBUG 06-03 13:24:59 [config.py:4533] disabled custom ops: Counter({'rms_norm': 131, 'silu_and_mul': 40, 'gelu_and_mul': 1, 'rotary_embedding': 1})
DEBUG 06-03 13:24:59 [config.py:4531] enabled custom ops: Counter()
DEBUG 06-03 13:24:59 [config.py:4533] disabled custom ops: Counter({'rms_norm': 131, 'silu_and_mul': 40, 'gelu_and_mul': 1, 'rotary_embedding': 1})
INFO 06-03 13:25:00 [weight_utils.py:291] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:02,  1.42it/s]
DEBUG 06-03 13:25:00 [utils.py:169] Loaded weight lm_head.weight with shape torch.Size([131072, 5120])
DEBUG 06-03 13:25:01 [utils.py:169] Loaded weight multi_modal_projector.linear_1.weight with shape torch.Size([5120, 1024])
DEBUG 06-03 13:25:01 [utils.py:169] Loaded weight multi_modal_projector.linear_2.weight with shape torch.Size([5120, 5120])
DEBUG 06-03 13:25:01 [utils.py:169] Loaded weight multi_modal_projector.norm.weight with shape torch.Size([1024])
DEBUG 06-03 13:25:01 [utils.py:169] Loaded weight multi_modal_projector.patch_merger.merging_layer.weight with shape torch.Size([1024, 4096])
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.64it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.90it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.72it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.71it/s]

INFO 06-03 13:25:02 [default_loader.py:280] Loading weights took 2.43 seconds
ERROR 06-03 13:25:02 [core.py:500] EngineCore failed to start.
ERROR 06-03 13:25:02 [core.py:500] Traceback (most recent call last):
ERROR 06-03 13:25:02 [core.py:500]   File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 491, in run_engine_core
ERROR 06-03 13:25:02 [core.py:500]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 06-03 13:25:02 [core.py:500]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-03 13:25:02 [core.py:500]   File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 390, in __init__
ERROR 06-03 13:25:02 [core.py:500]     super().__init__(vllm_config, executor_class, log_stats,
ERROR 06-03 13:25:02 [core.py:500]   File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 71, in __init__
ERROR 06-03 13:25:02 [core.py:500]     self.model_executor = executor_class(vllm_config)
ERROR 06-03 13:25:02 [core.py:500]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-03 13:25:02 [core.py:500]   File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 52, in __init__
ERROR 06-03 13:25:02 [core.py:500]     self._init_executor()
ERROR 06-03 13:25:02 [core.py:500]   File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
ERROR 06-03 13:25:02 [core.py:500]     self.collective_rpc("load_model")
ERROR 06-03 13:25:02 [core.py:500]   File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 06-03 13:25:02 [core.py:500]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 06-03 13:25:02 [core.py:500]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-03 13:25:02 [core.py:500]   File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/utils.py", line 2605, in run_method
ERROR 06-03 13:25:02 [core.py:500]     return func(*args, **kwargs)
ERROR 06-03 13:25:02 [core.py:500]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 06-03 13:25:02 [core.py:500]   File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 164, in load_model
ERROR 06-03 13:25:02 [core.py:500]     self.model_runner.load_model()
ERROR 06-03 13:25:02 [core.py:500]   File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1534, in load_model
ERROR 06-03 13:25:02 [core.py:500]     self.model = get_model(vllm_config=self.vllm_config)
ERROR 06-03 13:25:02 [core.py:500]                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-03 13:25:02 [core.py:500]   File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/model_executor/model_loader/__init__.py", line 58, in get_model
ERROR 06-03 13:25:02 [core.py:500]     return loader.load_model(vllm_config=vllm_config,
ERROR 06-03 13:25:02 [core.py:500]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 06-03 13:25:02 [core.py:500]   File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/model_executor/model_loader/default_loader.py", line 293, in load_model
ERROR 06-03 13:25:02 [core.py:500]     process_weights_after_loading(model, model_config, target_device)
ERROR 06-03 13:25:02 [core.py:500]   File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 108, in process_weights_after_loading
ERROR 06-03 13:25:02 [core.py:500]     quant_method.process_weights_after_loading(module)
ERROR 06-03 13:25:02 [core.py:500]   File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 564, in process_weights_after_loading
ERROR 06-03 13:25:02 [core.py:500]     layer.scheme.process_weights_after_loading(layer)
ERROR 06-03 13:25:02 [core.py:500]   File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_wNa16.py", line 196, in process_weights_after_loading
ERROR 06-03 13:25:02 [core.py:500]     self.kernel.process_weights_after_loading(layer)
ERROR 06-03 13:25:02 [core.py:500]   File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/kernels/mixed_precision/machete.py", line 94, in process_weights_after_loading
ERROR 06-03 13:25:02 [core.py:500]     self._transform_param(layer, self.w_s_name, transform_w_s)
ERROR 06-03 13:25:02 [core.py:500]   File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/kernels/mixed_precision/MPLinearKernel.py", line 70, in _transform_param
ERROR 06-03 13:25:02 [core.py:500]     new_param = fn(old_param)
ERROR 06-03 13:25:02 [core.py:500]                 ^^^^^^^^^^^^^
ERROR 06-03 13:25:02 [core.py:500]   File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/kernels/mixed_precision/machete.py", line 89, in transform_w_s
ERROR 06-03 13:25:02 [core.py:500]     x.data = x.data.contiguous()
ERROR 06-03 13:25:02 [core.py:500]              ^^^^^^^^^^^^^^^^^^^
ERROR 06-03 13:25:02 [core.py:500] RuntimeError: CUDA error: no kernel image is available for execution on the device
ERROR 06-03 13:25:02 [core.py:500] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
ERROR 06-03 13:25:02 [core.py:500] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
ERROR 06-03 13:25:02 [core.py:500] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
ERROR 06-03 13:25:02 [core.py:500]
Process EngineCore_0:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 504, in run_engine_core
    raise e
  File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 491, in run_engine_core
    engine_core = EngineCoreProc(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 390, in __init__
    super().__init__(vllm_config, executor_class, log_stats,
  File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 71, in __init__
    self.model_executor = executor_class(vllm_config)
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 52, in __init__
    self._init_executor()
  File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
    self.collective_rpc("load_model")
  File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
    answer = run_method(self.driver_worker, method, args, kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/utils.py", line 2605, in run_method
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 164, in load_model
    self.model_runner.load_model()
  File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 1534, in load_model
    self.model = get_model(vllm_config=self.vllm_config)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/model_executor/model_loader/__init__.py", line 58, in get_model
    return loader.load_model(vllm_config=vllm_config,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/model_executor/model_loader/default_loader.py", line 293, in load_model
    process_weights_after_loading(model, model_config, target_device)
  File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/model_executor/model_loader/utils.py", line 108, in process_weights_after_loading
    quant_method.process_weights_after_loading(module)
  File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py", line 564, in process_weights_after_loading
    layer.scheme.process_weights_after_loading(layer)
  File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_wNa16.py", line 196, in process_weights_after_loading
    self.kernel.process_weights_after_loading(layer)
  File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/kernels/mixed_precision/machete.py", line 94, in process_weights_after_loading
    self._transform_param(layer, self.w_s_name, transform_w_s)
  File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/kernels/mixed_precision/MPLinearKernel.py", line 70, in _transform_param
    new_param = fn(old_param)
                ^^^^^^^^^^^^^
  File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/model_executor/layers/quantization/kernels/mixed_precision/machete.py", line 89, in transform_w_s
    x.data = x.data.contiguous()
             ^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

[rank0]:[W603 13:25:03.483425615 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
Traceback (most recent call last):
  File "/home/matt/Desktop/temp-env-5/bin/vllm", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 56, in main
    args.dispatch_function(args)
  File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 42, in cmd
    uvloop.run(run_server(args))
  File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1324, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 153, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 185, in build_async_engine_client_from_engine_args
    async_llm = AsyncLLM.from_vllm_config(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 157, in from_vllm_config
    return cls(
           ^^^^
  File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 123, in __init__
    self.engine_core = core_client_class(
                       ^^^^^^^^^^^^^^^^^^
  File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 734, in __init__
    super().__init__(
  File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 418, in __init__
    self._wait_for_engine_startup(output_address, parallel_config)
  File "/home/matt/Desktop/temp-env-5/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 484, in _wait_for_engine_startup
    raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@mchambrec mchambrec added the bug Something isn't working label Jun 3, 2025
@yewentao256
Copy link

Tried executing vllm serve ISTA-DASLab/Mistral-Small-3.1-24B-Instruct-2503-GPTQ-4b-128g --max-model-len 8192 --max-seq-len 8192 in H100 and didn't meet your problem.

DEBUG 06-03 19:16:43 [compressed_tensors.py:473] Using scheme: CompressedTensorsWNA16 for language_model.model.layers.39.mlp.down_proj
INFO 06-03 19:16:43 [backends.py:37] Using InductorAdaptor
DEBUG 06-03 19:16:43 [config.py:4632] enabled custom ops: Counter()
DEBUG 06-03 19:16:43 [config.py:4634] disabled custom ops: Counter({'rms_norm': 131, 'silu_and_mul': 40, 'gelu_and_mul': 1, 'rotary_embedding': 1})
DEBUG 06-03 19:16:43 [config.py:4632] enabled custom ops: Counter()
DEBUG 06-03 19:16:43 [config.py:4634] disabled custom ops: Counter({'rms_norm': 131, 'silu_and_mul': 40, 'gelu_and_mul': 1, 'rotary_embedding': 1})
INFO 06-03 19:16:43 [weight_utils.py:291] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:01,  2.58it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.51it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.29it/s]
DEBUG 06-03 19:16:46 [utils.py:169] Loaded weight lm_head.weight with shape torch.Size([131072, 5120])
DEBUG 06-03 19:16:46 [utils.py:169] Loaded weight multi_modal_projector.linear_1.weight with shape torch.Size([5120, 1024])
DEBUG 06-03 19:16:46 [utils.py:169] Loaded weight multi_modal_projector.linear_2.weight with shape torch.Size([5120, 5120])
DEBUG 06-03 19:16:46 [utils.py:169] Loaded weight multi_modal_projector.norm.weight with shape torch.Size([1024])
DEBUG 06-03 19:16:46 [utils.py:169] Loaded weight multi_modal_projector.patch_merger.merging_layer.weight with shape torch.Size([1024, 4096])
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.35it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.41it/s]

INFO 06-03 19:16:46 [default_loader.py:271] Loading weights took 2.94 seconds
INFO 06-03 19:16:47 [gpu_model_runner.py:1593] Model loading took 14.0463 GiB and 3.847190 seconds
INFO 06-03 19:16:47 [gpu_model_runner.py:1913] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 3 image items of the maximum feature size.
...
INFO:     Started server process [3228240]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

Could you rerun with CUDA_LAUNCH_BLOCKING=1?

@yewentao256
Copy link

This error occurs because the official vLLM Docker images (up to v0.8.5) do not include CUDA architectures for new GPUs (like the RTX 5090, compute capability sm_120). vLLM 0.9.0 adds support, but you must build the Docker image yourself with the correct CUDA arch flags until an official image is released.

Specifically, set torch_cuda_arch_list="12.0 12.1" during the Docker build to ensure compatibility with the 5090. The error will persist if you use prebuilt images or wheels that lack these architectures, even with the correct CUDA and PyTorch versions installed.

To resolve, build the Docker image with the appropriate build arguments. Example build command:

DOCKER_BUILDKIT=1 sudo docker build . --target vllm-openai \
  --tag myvllm --file docker/Dockerfile \
  --build-arg max_jobs=4 \
  --build-arg nvcc_threads=1 \
  --build-arg torch_cuda_arch_list="12.0 12.1" \
  --build-arg RUN_WHEEL_CHECK=false

Then run the container as usual. For more details and troubleshooting, see the discussion in vLLM issue #16901 and vLLM issue #17739.

Or you can try compile the kernels by yourself directly.

@mgoin mgoin closed this as completed Jun 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants