Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Qwen 2.5 Quantization is slower than fp16 with vLLM #717

Open
orionw opened this issue Feb 22, 2025 · 2 comments
Open

Qwen 2.5 Quantization is slower than fp16 with vLLM #717

orionw opened this issue Feb 22, 2025 · 2 comments

Comments

@orionw
Copy link

orionw commented Feb 22, 2025

Similar to #645, I am getting worse performance and throughput with the quantized version. I used the out of the box quantization example with the basic vLLM script. This is true for the 7B and 14B.

I am using vLLM and see roughly 1.8x slower throughput. When I run the benchmark script I see better performance with AWQ though:

Using the benchmarking script:

 -- Loading model...
 -- Warming up...
 -- Generating 32 tokens, 32 in context...
 ** Speed (Prefill): 228.75 tokens/second
 ** Speed (Decode): 86.49 tokens/second
 ** Max Memory (device: 0): 5.40 GB (5.80%)
 -- Loading model...
 -- Warming up...
 -- Generating 64 tokens, 64 in context...
 ** Speed (Prefill): 3486.72 tokens/second
 ** Speed (Decode): 86.38 tokens/second
 ** Max Memory (device: 0): 5.40 GB (5.80%)
 -- Loading model...
 -- Warming up...
 -- Generating 128 tokens, 128 in context...
 ** Speed (Prefill): 4590.52 tokens/second
 ** Speed (Decode): 85.33 tokens/second
 ** Max Memory (device: 0): 5.41 GB (5.81%)
 -- Loading model...
 -- Warming up...
 -- Generating 256 tokens, 256 in context...
 ** Speed (Prefill): 5008.78 tokens/second
 ** Speed (Decode): 85.19 tokens/second
 ** Max Memory (device: 0): 5.43 GB (5.83%)
 -- Loading model...
 -- Warming up...
 -- Generating 512 tokens, 512 in context...
 ** Speed (Prefill): 5496.49 tokens/second
 ** Speed (Decode): 84.98 tokens/second
 ** Max Memory (device: 0): 5.54 GB (5.95%)
 -- Loading model...
 -- Warming up...
 -- Generating 1024 tokens, 1024 in context...
 ** Speed (Prefill): 15427.16 tokens/second
 ** Speed (Decode): 84.86 tokens/second
 ** Max Memory (device: 0): 5.71 GB (6.13%)
 -- Loading model...
 -- Warming up...
 -- Generating 2048 tokens, 2048 in context...
 ** Speed (Prefill): 18722.37 tokens/second
 ** Speed (Decode): 84.74 tokens/second
 ** Max Memory (device: 0): 6.14 GB (6.60%)
 -- Loading model...
 -- Warming up...
 -- Generating 4096 tokens, 4096 in context...
 ** Speed (Prefill): 20145.56 tokens/second
 ** Speed (Decode): 84.65 tokens/second
 ** Max Memory (device: 0): 7.01 GB (7.52%)
Device: cuda:0
GPU: NVIDIA H100 NVL
Model: /home/oweller2/my_scratch/AutoAWQ/qwen-7b-awq/
Version: gemm
|   Batch Size |   Prefill Length |   Decode Length |   Prefill tokens/s |   Decode tokens/s | Memory (VRAM)   |
|-------------:|-----------------:|----------------:|-------------------:|------------------:|:----------------|
|            1 |               32 |              32 |             228.75 |             86.49 | 5.40 GB (5.80%) |
|            1 |               64 |              64 |            3486.72 |             86.38 | 5.40 GB (5.80%) |
|            1 |              128 |             128 |            4590.52 |             85.33 | 5.41 GB (5.81%) |
|            1 |              256 |             256 |            5008.78 |             85.19 | 5.43 GB (5.83%) |
|            1 |              512 |             512 |            5496.49 |             84.98 | 5.54 GB (5.95%) |
|            1 |             1024 |            1024 |           15427.2  |             84.86 | 5.71 GB (6.13%) |
|            1 |             2048 |            2048 |           18722.4  |             84.74 | 6.14 GB (6.60%) |
|            1 |             4096 |            4096 |           20145.6  |             84.65 | 7.01 GB (7.52%) |

vs non-quantized:

 -- Loading model...
 -- Warming up...
 -- Generating 32 tokens, 32 in context...
 ** Speed (Prefill): 236.07 tokens/second
 ** Speed (Decode): 72.46 tokens/second
 ** Max Memory (device: 0): 14.38 GB (15.45%)
 -- Loading model...
 -- Warming up...
 -- Generating 64 tokens, 64 in context...
 ** Speed (Prefill): 3610.96 tokens/second
 ** Speed (Decode): 72.52 tokens/second
 ** Max Memory (device: 0): 14.38 GB (15.45%)
 -- Loading model...
 -- Warming up...
 -- Generating 128 tokens, 128 in context...
 ** Speed (Prefill): 7661.59 tokens/second
 ** Speed (Decode): 72.35 tokens/second
 ** Max Memory (device: 0): 14.38 GB (15.45%)
 -- Loading model...
 -- Warming up...
 -- Generating 256 tokens, 256 in context...
 ** Speed (Prefill): 13484.31 tokens/second
 ** Speed (Decode): 72.53 tokens/second
 ** Max Memory (device: 0): 14.38 GB (15.45%)
 -- Loading model...
 -- Warming up...
 -- Generating 512 tokens, 512 in context...
 ** Speed (Prefill): 20993.46 tokens/second
 ** Speed (Decode): 72.07 tokens/second
 ** Max Memory (device: 0): 14.43 GB (15.50%)
 -- Loading model...
 -- Warming up...
 -- Generating 1024 tokens, 1024 in context...
 ** Speed (Prefill): 24013.15 tokens/second
 ** Speed (Decode): 72.44 tokens/second
 ** Max Memory (device: 0): 14.61 GB (15.69%)
 -- Loading model...
 -- Warming up...
 -- Generating 2048 tokens, 2048 in context...
 ** Speed (Prefill): 22595.46 tokens/second
 ** Speed (Decode): 72.41 tokens/second
 ** Max Memory (device: 0): 14.97 GB (16.07%)
 -- Loading model...
 -- Warming up...
 -- Generating 4096 tokens, 4096 in context...
 ** Speed (Prefill): 24222.07 tokens/second
 ** Speed (Decode): 72.35 tokens/second
 ** Max Memory (device: 0): 15.67 GB (16.82%)
Device: cuda:0
GPU: NVIDIA H100 NVL
Model: /home/oweller2/my_scratch/AutoAWQ/qwen-7b-custom
Version: FP16
|   Batch Size |   Prefill Length |   Decode Length |   Prefill tokens/s |   Decode tokens/s | Memory (VRAM)     |
|-------------:|-----------------:|----------------:|-------------------:|------------------:|:------------------|
|            1 |               32 |              32 |             236.07 |             72.46 | 14.38 GB (15.45%) |
|            1 |               64 |              64 |            3610.96 |             72.52 | 14.38 GB (15.45%) |
|            1 |              128 |             128 |            7661.59 |             72.35 | 14.38 GB (15.45%) |
|            1 |              256 |             256 |           13484.3  |             72.53 | 14.38 GB (15.45%) |
|            1 |              512 |             512 |           20993.5  |             72.07 | 14.43 GB (15.50%) |
|            1 |             1024 |            1024 |           24013.2  |             72.44 | 14.61 GB (15.69%) |
|            1 |             2048 |            2048 |           22595.5  |             72.41 | 14.97 GB (16.07%) |
|            1 |             4096 |            4096 |           24222.1  |             72.35 | 15.67 GB (16.82%) |

installed are:

vllm==0.7.2
autoawq==0.2.8
autoawq_kernels==0.0.9

with

        self.sampling_params = SamplingParams(
            temperature=0,
            max_tokens=max_output_tokens,
            logprobs=20,
            skip_special_tokens=False
        )
        self.model = LLM(
            model=model_name_or_path,
            tensor_parallel_size=int(num_gpus),
            trust_remote_code=True,
            max_model_len=context_size,
            gpu_memory_utilization=0.9,
            quantization="AWQ",
            dtype="float16"
        )

Am I using vLLM in a bad way / do I need other packages for AWQ to work?

@orionw
Copy link
Author

orionw commented Feb 22, 2025

update: removing quantization="AWQ" (per this link) seems to speed it up, but still slower than FP16.

@Juntongkuki
Copy link

update: removing quantization="AWQ" (per this link) seems to speed it up, but still slower than FP16.

I have the same problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants