[Bug]: deepseek-v2-lite-w8a8 quantizaion inference repeated output #628

Potabk · 2025-04-23T06:37:34Z

Your current environment

The output of `python collect_env.py`

INFO 04-23 06:24:43 [__init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 04-23 06:24:43 [__init__.py:32] name=ascend, value=vllm_ascend:register
INFO 04-23 06:24:43 [__init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 04-23 06:24:43 [__init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-23 06:24:43 [__init__.py:44] plugin ascend loaded.
INFO 04-23 06:24:43 [__init__.py:230] Platform plugin ascend is activated
Collecting environment information...
PyTorch version: 2.5.1
Is debug build: False

OS: Ubuntu 22.04.5 LTS (aarch64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 4.0.0
Libc version: glibc-2.35

Python version: 3.10.15 (main, Nov 27 2024, 06:51:55) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-4.19.90-vhulk2211.3.0.h1960.eulerosv2r10.aarch64-aarch64-with-glibc2.35

CPU:
Architecture:                       aarch64
CPU op-mode(s):                     64-bit
Byte Order:                         Little Endian
CPU(s):                             192
On-line CPU(s) list:                0-191
Vendor ID:                          HiSilicon
Model name:                         Kunpeng-920
Model:                              0
Thread(s) per core:                 1
Core(s) per cluster:                48
Socket(s):                          -
Cluster(s):                         4
Stepping:                           0x1
BogoMIPS:                           200.00
Flags:                              fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma dcpop asimddp asimdfhm ssbs
L1d cache:                          12 MiB (192 instances)
L1i cache:                          12 MiB (192 instances)
L2 cache:                           96 MiB (192 instances)
L3 cache:                           192 MiB (8 instances)
NUMA node(s):                       8
NUMA node0 CPU(s):                  0-23
NUMA node1 CPU(s):                  24-47
NUMA node2 CPU(s):                  48-71
NUMA node3 CPU(s):                  72-95
NUMA node4 CPU(s):                  96-119
NUMA node5 CPU(s):                  120-143
NUMA node6 CPU(s):                  144-167
NUMA node7 CPU(s):                  168-191
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; __user pointer sanitization
Vulnerability Spectre v2:           Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] pyzmq==26.3.0
[pip3] torch==2.5.1
[pip3] torch-npu==2.5.1.dev20250320
[pip3] torchvision==0.20.1
[pip3] transformers==4.51.3
[conda] Could not collect
vLLM Version: 0.8.4
vLLM Ascend Version: 0.1.dev159+g66a0837 (git sha: 66a0837)

ENV Variables:
ATB_OPSRUNNER_KERNEL_CACHE_TILING_SIZE=10240
ATB_OPSRUNNER_KERNEL_CACHE_LOCAL_COUNT=1
ATB_STREAM_SYNC_EVERY_RUNNER_ENABLE=0
ATB_OPSRUNNER_SETUP_CACHE_ENABLE=1
ATB_WORKSPACE_MEM_ALLOC_GLOBAL=0
ATB_DEVICE_TILING_BUFFER_BLOCK_NUM=32
ATB_STREAM_SYNC_EVERY_KERNEL_ENABLE=0
ATB_OPSRUNNER_KERNEL_CACHE_GLOABL_COUNT=5
ATB_HOME_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0
ASCEND_TOOLKIT_HOME=/usr/local/Ascend/ascend-toolkit/latest
ATB_COMPARE_TILING_EVERY_KERNEL=0
ASCEND_OPP_PATH=/usr/local/Ascend/ascend-toolkit/latest/opp
LD_LIBRARY_PATH=/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/lib:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/examples:/usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_0/tests/atbopstest:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64:/usr/local/Ascend/ascend-toolkit/latest/tools/aml/lib64/plugin:/usr/local/Ascend/ascend-toolkit/latest/lib64:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/opskernel:/usr/local/Ascend/ascend-toolkit/latest/lib64/plugin/nnengine:/usr/local/Ascend/ascend-toolkit/latest/opp/built-in/op_impl/ai_core/tbe/op_tiling/lib/linux/aarch64:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver:/usr/local/Ascend/driver/lib64/common:/usr/local/Ascend/driver/lib64/driver:
ASCEND_AICPU_PATH=/usr/local/Ascend/ascend-toolkit/latest
ATB_OPSRUNNER_KERNEL_CACHE_TYPE=3
ATB_RUNNER_POOL_SIZE=64
ATB_STREAM_SYNC_EVERY_OPERATION_ENABLE=0
ASCEND_HOME_PATH=/usr/local/Ascend/ascend-toolkit/latest
ATB_MATMUL_SHUFFLE_K_ENABLE=1
ATB_LAUNCH_KERNEL_WITH_TILING=1
ATB_WORKSPACE_MEM_ALLOC_ALG_TYPE=1
ATB_HOST_TILING_BUFFER_BLOCK_NUM=128
ATB_SHARE_MEMORY_NAME_SUFFIX=
TORCH_DEVICE_BACKEND_AUTOLOAD=1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1


NPU:
+------------------------------------------------------------------------------------------------+
| npu-smi 24.1.rc3                 Version: 24.1.rc3                                             |
+---------------------------+---------------+----------------------------------------------------+
| NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
| Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
+===========================+===============+====================================================+
| 0     910B4               | OK            | 90.8        39                0    / 0             |
| 0                         | 0000:C1:00.0  | 0           0    / 0          29070/ 32768         |
+===========================+===============+====================================================+
| 1     910B4               | OK            | 86.7        40                0    / 0             |
| 0                         | 0000:01:00.0  | 0           0    / 0          2824 / 32768         |
+===========================+===============+====================================================+
| 3     910B4               | OK            | 92.4        39                0    / 0             |
| 0                         | 0000:02:00.0  | 0           0    / 0          2834 / 32768         |
+===========================+===============+====================================================+
| 4     910B4               | OK            | 82.6        40                0    / 0             |
| 0                         | 0000:81:00.0  | 0           0    / 0          2826 / 32768         |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU     Chip              | Process id    | Process name             | Process memory(MB)      |
+===========================+===============+====================================================+
| 0       0                 | 5063          | python3.10               | 26293                   |
+===========================+===============+====================================================+
| No running processes found in NPU 1                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 3                                                            |
+===========================+===============+====================================================+
| No running processes found in NPU 4                                                            |
+===========================+===============+====================================================+

CANN:
package_name=Ascend-cann-toolkit
version=8.0.0
innerversion=V100R001C20SPC001B251
compatible_version=[V100R001C15],[V100R001C17],[V100R001C18],[V100R001C19],[V100R001C20]
arch=aarch64
os=linux
path=/usr/local/Ascend/ascend-toolkit/8.0.0/aarch64-linux

🐛 Describe the bug

I'm using deekseek-v2-lite w8a8 quantization feature, and i encountered an unexpected bug:
The model is quantized following to the https://github.com/vllm-project/vllm-ascend/pull/580#issuecomment-2816747613，
serving command:

vllm serve dspk-fully-quant-w8a8 --max-model-len 4096  -tp 1 --trust-remote-code

and the server setup normally:

INFO 04-23 06:20:25 [api_server.py:1081] Starting vLLM API server on http://0.0.0.0:8000
INFO 04-23 06:20:25 [launcher.py:26] Available routes are:
INFO 04-23 06:20:25 [launcher.py:34] Route: /openapi.json, Methods: GET, HEAD
INFO 04-23 06:20:25 [launcher.py:34] Route: /docs, Methods: GET, HEAD
INFO 04-23 06:20:25 [launcher.py:34] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 04-23 06:20:25 [launcher.py:34] Route: /redoc, Methods: GET, HEAD
INFO 04-23 06:20:25 [launcher.py:34] Route: /health, Methods: GET
INFO 04-23 06:20:25 [launcher.py:34] Route: /load, Methods: GET
INFO 04-23 06:20:25 [launcher.py:34] Route: /ping, Methods: GET, POST
INFO 04-23 06:20:25 [launcher.py:34] Route: /tokenize, Methods: POST
INFO 04-23 06:20:25 [launcher.py:34] Route: /detokenize, Methods: POST
INFO 04-23 06:20:25 [launcher.py:34] Route: /v1/models, Methods: GET
INFO 04-23 06:20:25 [launcher.py:34] Route: /version, Methods: GET
INFO 04-23 06:20:25 [launcher.py:34] Route: /v1/chat/completions, Methods: POST
INFO 04-23 06:20:25 [launcher.py:34] Route: /v1/completions, Methods: POST
INFO 04-23 06:20:25 [launcher.py:34] Route: /v1/embeddings, Methods: POST
INFO 04-23 06:20:25 [launcher.py:34] Route: /pooling, Methods: POST
INFO 04-23 06:20:25 [launcher.py:34] Route: /score, Methods: POST
INFO 04-23 06:20:25 [launcher.py:34] Route: /v1/score, Methods: POST
INFO 04-23 06:20:25 [launcher.py:34] Route: /v1/audio/transcriptions, Methods: POST
INFO 04-23 06:20:25 [launcher.py:34] Route: /rerank, Methods: POST
INFO 04-23 06:20:25 [launcher.py:34] Route: /v1/rerank, Methods: POST
INFO 04-23 06:20:25 [launcher.py:34] Route: /v2/rerank, Methods: POST
INFO 04-23 06:20:25 [launcher.py:34] Route: /invocations, Methods: POST
INFO 04-23 06:20:25 [launcher.py:34] Route: /metrics, Methods: GET
INFO:     Started server process [4918]
INFO:     Waiting for application startup.
INFO:     Application startup complete.

the client input:

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "dspk-fully-quant-w8a8",
        "prompt": "deepseek是什么？",
        "max_tokens": "128",
        "top_p": "0.95",
        "top_k": "40",
        "temperature": "0.0"
    }'

but the output is strange：

{"id":"cmpl-0797e36c26c5492ea68c93a0d97fc478","object":"text_completion","created":1745389315,"model":"dspk-fully-quant-w8a8","choices":[{"index":0,"text":"\n\n\n20132012111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":133,"completion_tokens":128,"prompt_tokens_details":null}}

The text was updated successfully, but these errors were encountered:

Potabk · 2025-04-23T06:47:31Z

update:
when I serve the quantized model usingtp 4 with multi progress, everything is back to normal:

vllm serve dspk-fully-quant-w8a8 --max-model-len 4096  -tp 4 --trust-remote-code

same client input:

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "dspk-fully-quant-w8a8",
        "prompt": "deepseek是什么？",
        "max_tokens": "128",
        "top_p": "0.95",
        "top_k": "40",
        "temperature": "0.0"
    }'

the output:

{"id":"cmpl-1c7eacdcf6354b41823091b36032ad5b","object":"text_completion","created":1745390598,"model":"dspk-fully-quant-w8a8","choices":[{"index":0,"text":"\ndeepseek是一款基于人工智能技术的智能搜索工具，它能够帮助用户快速、准确地找到所需的信息。\ndeepseek有哪些功能？\ndeepseek具有多种功能，包括：\n1. 智能搜索：deepseek能够根据用户输入的关键词，快速搜索出相关信息，并提供多种搜索结果。\n2. 语音搜索：deepseek支持语音搜索，用户可以通过语音输入关键词，快速搜索出相关信息。\n3. 图片搜索：deepseek能够识别图片中的内容，并根据图片内容搜索出相关信息。\n4. 视频搜索：deepseek能够识别视频中的","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":5,"total_tokens":133,"completion_tokens":128,"prompt_tokens_details":null}}

I want to know What is the difference and what is the impact？

wangxiyuan · 2025-05-14T05:59:32Z

#453 let's put the problem and feedback here

learning-chip · 2025-05-14T13:16:57Z

I hit a similar issue. The model weights are created by strictly following commands in https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_npu_quantization.html#install-modelslim-and-convert-model, including the exact commit git checkout a396750f930e3bd2b8aa13730401dcbb4bc684ca

I uploaded a copy of weights to https://huggingface.co/jay-zhuang/DeepSeek-V2-Lite-Chat-w8a8-act2-npu , which you can download and reproduce the bug below.

Then, the run-time Docker environment strictly follows https://vllm-ascend.readthedocs.io/en/latest/installation.html#configure-a-new-environment

The model load successfully, but I also get incorrect output using this minimum script.

from vllm import LLM, SamplingParams

model_path = "/model_weights/DeepSeek-V2-Lite-Chat-w8a8-act2-npu"
model = LLM(
    model=model_path,
    max_num_seqs=16,
    tensor_parallel_size=1,
    trust_remote_code=True,
    max_model_len=2048
)

prompts = [
    "Hello, my name is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
outputs = model.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

Output is:

Prompt: 'Hello, my name is', Generated text: '!!!!!!!!!!!!!!!!'
Prompt: 'The future of AI is', Generated text: '!!!!!!!!!!!!!!!!'

In comparison, un-quantized original version https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite-Chat is able to reproduce correct output, using the same script but just a different model path:

Prompt: 'Hello, my name is', Generated text: ' Alex Simmons and I am the proud owner of this blog. I am a licensed'
Prompt: 'The future of AI is', Generated text: ' here: AI-powered solutions have rapidly advanced in recent years, bringing with them'

Quantized version with tensor_parallel_size=2 also gives ok results:

Prompt: 'Hello, my name is', Generated text: ' Rachel. I am a mother of three wonderful children and an avid lover of all'
Prompt: 'The future of AI is', Generated text: ' often seen as a major driver of change in the workplace. The use of AI'

tangzhiyi11 · 2025-05-16T08:10:16Z

#883 i got wrong output when using deepseek-v2-lite-w8a8

learning-chip · 2025-05-16T11:12:42Z

Quantized version with tensor_parallel_size=2 also gives ok results:

The problem is, TP=2 is very slow due to the PyTorch eager mode and the large launch overhead of allreduce. If TP=1 can work correctly, it will be much more efficient for ds-v2-lite-w8a8 model.

Add this profiling to the above test script

with torch_npu.profiler.profile(
        on_trace_ready=torch_npu.profiler.tensorboard_trace_handler(log_dir),
        activities=[torch_npu.profiler.ProfilerActivity.CPU, torch_npu.profiler.ProfilerActivity.NPU],
        record_shapes=True,
        profile_memory=True,
        with_flops=True,
        with_stack=True
) as prof:
    outputs = model.generate(prompts, sampling_params)

allreduce takes >60% of time:

Potabk added the bug Something isn't working label Apr 23, 2025

wangxiyuan closed this as completed May 14, 2025

This was referenced May 14, 2025

[RFC]: Add w8a8 Quantization #453

Open

[Bug]: deepseek-v2-lite-w8a8 精度不对 #883

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: deepseek-v2-lite-w8a8 quantizaion inference repeated output #628

[Bug]: deepseek-v2-lite-w8a8 quantizaion inference repeated output #628

Potabk commented Apr 23, 2025 •

edited

Loading

Potabk commented Apr 23, 2025 •

edited

Loading

Uh oh!

wangxiyuan commented May 14, 2025

Uh oh!

learning-chip commented May 14, 2025 •

edited

Loading

Uh oh!

tangzhiyi11 commented May 16, 2025

Uh oh!

learning-chip commented May 16, 2025 •

edited

Loading

Uh oh!

[Bug]: deepseek-v2-lite-w8a8 quantizaion inference repeated output #628

[Bug]: deepseek-v2-lite-w8a8 quantizaion inference repeated output #628

Comments

Potabk commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Your current environment

🐛 Describe the bug

Potabk commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wangxiyuan commented May 14, 2025

Uh oh!

learning-chip commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tangzhiyi11 commented May 16, 2025

Uh oh!

learning-chip commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Potabk commented Apr 23, 2025 •

edited

Loading

Potabk commented Apr 23, 2025 •

edited

Loading

learning-chip commented May 14, 2025 •

edited

Loading

learning-chip commented May 16, 2025 •

edited

Loading