Qwen 2.5 Quantization is slower than fp16 with vLLM #717

orionw · 2025-02-22T02:57:56Z

Similar to #645, I am getting worse performance and throughput with the quantized version. I used the out of the box quantization example with the basic vLLM script. This is true for the 7B and 14B.

I am using vLLM and see roughly 1.8x slower throughput. When I run the benchmark script I see better performance with AWQ though:

Using the benchmarking script:

 -- Loading model...
 -- Warming up...
 -- Generating 32 tokens, 32 in context...
 ** Speed (Prefill): 228.75 tokens/second
 ** Speed (Decode): 86.49 tokens/second
 ** Max Memory (device: 0): 5.40 GB (5.80%)
 -- Loading model...
 -- Warming up...
 -- Generating 64 tokens, 64 in context...
 ** Speed (Prefill): 3486.72 tokens/second
 ** Speed (Decode): 86.38 tokens/second
 ** Max Memory (device: 0): 5.40 GB (5.80%)
 -- Loading model...
 -- Warming up...
 -- Generating 128 tokens, 128 in context...
 ** Speed (Prefill): 4590.52 tokens/second
 ** Speed (Decode): 85.33 tokens/second
 ** Max Memory (device: 0): 5.41 GB (5.81%)
 -- Loading model...
 -- Warming up...
 -- Generating 256 tokens, 256 in context...
 ** Speed (Prefill): 5008.78 tokens/second
 ** Speed (Decode): 85.19 tokens/second
 ** Max Memory (device: 0): 5.43 GB (5.83%)
 -- Loading model...
 -- Warming up...
 -- Generating 512 tokens, 512 in context...
 ** Speed (Prefill): 5496.49 tokens/second
 ** Speed (Decode): 84.98 tokens/second
 ** Max Memory (device: 0): 5.54 GB (5.95%)
 -- Loading model...
 -- Warming up...
 -- Generating 1024 tokens, 1024 in context...
 ** Speed (Prefill): 15427.16 tokens/second
 ** Speed (Decode): 84.86 tokens/second
 ** Max Memory (device: 0): 5.71 GB (6.13%)
 -- Loading model...
 -- Warming up...
 -- Generating 2048 tokens, 2048 in context...
 ** Speed (Prefill): 18722.37 tokens/second
 ** Speed (Decode): 84.74 tokens/second
 ** Max Memory (device: 0): 6.14 GB (6.60%)
 -- Loading model...
 -- Warming up...
 -- Generating 4096 tokens, 4096 in context...
 ** Speed (Prefill): 20145.56 tokens/second
 ** Speed (Decode): 84.65 tokens/second
 ** Max Memory (device: 0): 7.01 GB (7.52%)
Device: cuda:0
GPU: NVIDIA H100 NVL
Model: /home/oweller2/my_scratch/AutoAWQ/qwen-7b-awq/
Version: gemm
|   Batch Size |   Prefill Length |   Decode Length |   Prefill tokens/s |   Decode tokens/s | Memory (VRAM)   |
|-------------:|-----------------:|----------------:|-------------------:|------------------:|:----------------|
|            1 |               32 |              32 |             228.75 |             86.49 | 5.40 GB (5.80%) |
|            1 |               64 |              64 |            3486.72 |             86.38 | 5.40 GB (5.80%) |
|            1 |              128 |             128 |            4590.52 |             85.33 | 5.41 GB (5.81%) |
|            1 |              256 |             256 |            5008.78 |             85.19 | 5.43 GB (5.83%) |
|            1 |              512 |             512 |            5496.49 |             84.98 | 5.54 GB (5.95%) |
|            1 |             1024 |            1024 |           15427.2  |             84.86 | 5.71 GB (6.13%) |
|            1 |             2048 |            2048 |           18722.4  |             84.74 | 6.14 GB (6.60%) |
|            1 |             4096 |            4096 |           20145.6  |             84.65 | 7.01 GB (7.52%) |

vs non-quantized:

 -- Loading model...
 -- Warming up...
 -- Generating 32 tokens, 32 in context...
 ** Speed (Prefill): 236.07 tokens/second
 ** Speed (Decode): 72.46 tokens/second
 ** Max Memory (device: 0): 14.38 GB (15.45%)
 -- Loading model...
 -- Warming up...
 -- Generating 64 tokens, 64 in context...
 ** Speed (Prefill): 3610.96 tokens/second
 ** Speed (Decode): 72.52 tokens/second
 ** Max Memory (device: 0): 14.38 GB (15.45%)
 -- Loading model...
 -- Warming up...
 -- Generating 128 tokens, 128 in context...
 ** Speed (Prefill): 7661.59 tokens/second
 ** Speed (Decode): 72.35 tokens/second
 ** Max Memory (device: 0): 14.38 GB (15.45%)
 -- Loading model...
 -- Warming up...
 -- Generating 256 tokens, 256 in context...
 ** Speed (Prefill): 13484.31 tokens/second
 ** Speed (Decode): 72.53 tokens/second
 ** Max Memory (device: 0): 14.38 GB (15.45%)
 -- Loading model...
 -- Warming up...
 -- Generating 512 tokens, 512 in context...
 ** Speed (Prefill): 20993.46 tokens/second
 ** Speed (Decode): 72.07 tokens/second
 ** Max Memory (device: 0): 14.43 GB (15.50%)
 -- Loading model...
 -- Warming up...
 -- Generating 1024 tokens, 1024 in context...
 ** Speed (Prefill): 24013.15 tokens/second
 ** Speed (Decode): 72.44 tokens/second
 ** Max Memory (device: 0): 14.61 GB (15.69%)
 -- Loading model...
 -- Warming up...
 -- Generating 2048 tokens, 2048 in context...
 ** Speed (Prefill): 22595.46 tokens/second
 ** Speed (Decode): 72.41 tokens/second
 ** Max Memory (device: 0): 14.97 GB (16.07%)
 -- Loading model...
 -- Warming up...
 -- Generating 4096 tokens, 4096 in context...
 ** Speed (Prefill): 24222.07 tokens/second
 ** Speed (Decode): 72.35 tokens/second
 ** Max Memory (device: 0): 15.67 GB (16.82%)
Device: cuda:0
GPU: NVIDIA H100 NVL
Model: /home/oweller2/my_scratch/AutoAWQ/qwen-7b-custom
Version: FP16
|   Batch Size |   Prefill Length |   Decode Length |   Prefill tokens/s |   Decode tokens/s | Memory (VRAM)     |
|-------------:|-----------------:|----------------:|-------------------:|------------------:|:------------------|
|            1 |               32 |              32 |             236.07 |             72.46 | 14.38 GB (15.45%) |
|            1 |               64 |              64 |            3610.96 |             72.52 | 14.38 GB (15.45%) |
|            1 |              128 |             128 |            7661.59 |             72.35 | 14.38 GB (15.45%) |
|            1 |              256 |             256 |           13484.3  |             72.53 | 14.38 GB (15.45%) |
|            1 |              512 |             512 |           20993.5  |             72.07 | 14.43 GB (15.50%) |
|            1 |             1024 |            1024 |           24013.2  |             72.44 | 14.61 GB (15.69%) |
|            1 |             2048 |            2048 |           22595.5  |             72.41 | 14.97 GB (16.07%) |
|            1 |             4096 |            4096 |           24222.1  |             72.35 | 15.67 GB (16.82%) |

installed are:

vllm==0.7.2
autoawq==0.2.8
autoawq_kernels==0.0.9

with

        self.sampling_params = SamplingParams(
            temperature=0,
            max_tokens=max_output_tokens,
            logprobs=20,
            skip_special_tokens=False
        )
        self.model = LLM(
            model=model_name_or_path,
            tensor_parallel_size=int(num_gpus),
            trust_remote_code=True,
            max_model_len=context_size,
            gpu_memory_utilization=0.9,
            quantization="AWQ",
            dtype="float16"
        )

Am I using vLLM in a bad way / do I need other packages for AWQ to work?

The text was updated successfully, but these errors were encountered:

orionw · 2025-02-22T03:07:53Z

update: removing quantization="AWQ" (per this link) seems to speed it up, but still slower than FP16.

Juntongkuki · 2025-03-09T06:25:13Z

update: removing quantization="AWQ" (per this link) seems to speed it up, but still slower than FP16.

I have the same problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen 2.5 Quantization is slower than fp16 with vLLM #717

Qwen 2.5 Quantization is slower than fp16 with vLLM #717

orionw commented Feb 22, 2025 •

edited

Loading

orionw commented Feb 22, 2025

Juntongkuki commented Mar 9, 2025

Qwen 2.5 Quantization is slower than fp16 with vLLM #717

Qwen 2.5 Quantization is slower than fp16 with vLLM #717

Comments

orionw commented Feb 22, 2025 • edited Loading

orionw commented Feb 22, 2025

Juntongkuki commented Mar 9, 2025

orionw commented Feb 22, 2025 •

edited

Loading