[Bug]: deepseek-v2-lite-w8a8 精度不对 #883

tangzhiyi11 · 2025-05-16T07:05:49Z

Your current environment

Your output of above commands here

🐛 Describe the bug

下载的模型地址：https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-w8a8

启动服务：

vllm serve /home/weight/DeepSeek-V2-Lite-w8a8  --tensor-parallel-size 4 --trust-remote-code --served-model-name "dpsk-w8a8" --max-model-len 4096

请求：

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "dpsk-w8a8",
        "prompt": "what is deepseek？",
        "max_tokens": "128",
        "top_p": "0.95",
        "top_k": "40",
        "temperature": "0.0"
    }'

返回结果：

{"id":"cmpl-2ef69dc2ac964e8aa3dafa6dcaee78a5","object":"text_completion","created":1747378218,"model":"dpsk-w8a8","choices":[{"index":0,"text":"\n)),)_)))....\".\".\".\".\"................................................................................................................................................................................................................................","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":6,"total_tokens":134,"completion_tokens":128,"prompt_tokens_details":null}}

完全参照 https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_npu_quantization.html 操作。

环境：
vllm-ascend main 分支
vllm v0.8.5.post1
cann: 8.1.RC1
npu: 910b

The text was updated successfully, but these errors were encountered:

tangzhiyi11 · 2025-05-16T07:10:05Z

离线测试：

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(model="/xxxxl/DeepSeek-V2-Lite-w8a8", trust_remote_code=True)

# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}\n, Generated text: {generated_text!r}")

结果：

INFO 05-16 07:08:34 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-16 07:08:34 [model_runner.py:953] Starting to load model /mnt/cwai/pjlab_data_new_hpfs/dev/share/deepseek_model/DeepSeek-V2-Lite-w8a8...
INFO 05-16 07:08:34 [quantizer.py:88] Using the vLLM Ascend Quantizer version now!
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:02<00:06,  2.10s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:04<00:01,  1.29s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:05<00:00,  1.31s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:05<00:00,  1.37s/it]

INFO 05-16 07:08:40 [loader.py:458] Loading weights took 5.48 seconds
INFO 05-16 07:08:40 [model_runner.py:958] Loading model weights took 15.2719 GB
[rank0]:[W516 07:08:43.716983494 compiler_depend.ts:28] Warning: The oprator of MoeInitRouting will be removed from Pytorch and switch to AscendSpeed after 630. (function operator())
INFO 05-16 07:08:49 [executor_base.py:112] # npu blocks: 4232, # CPU blocks: 1078
INFO 05-16 07:08:49 [executor_base.py:117] Maximum concurrency for 163840 tokens per request: 3.31x
INFO 05-16 07:08:49 [llm_engine.py:437] init engine (profile, create kv cache, warmup model) took 8.78 seconds
Processed prompts: 100%|________________________________________________________________| 4/4 [00:01<00:00,  2.99it/s, est. speed input: 19.40 toks/s, output: 47.76 toks/s]
Prompt: 'Hello, my name is'
, Generated text: ' Hello Hello. Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello'
Prompt: 'The president of the United States is'
, Generated text: ' The The...The...TheTheOntOntOntOntOntOntOntOntOnt'
Prompt: 'The capital of France is'
, Generated text: ' the...You", the capital at France is the.\n], The capital at'
Prompt: 'The future of AI is'
, Generated text: ' an AI.\n\n, ALAL AL AL AL AL AL AL AL AL'

learning-chip · 2025-05-16T07:34:15Z

I can get correct output with TP=2 following: #628 (comment). TP=1 still needs to be fixed though.

tangzhiyi11 · 2025-05-16T07:56:55Z

I can get correct output with TP=2 following: #628 (comment). TP=1 still needs to be fixed though.

@learning-chip I tested tp=2. but got wrong output.

code:

from vllm import LLM, SamplingParams

prompts = [
    "How are you?",
    "Please introduce China",
    "Is Shanghai the capital of China?",
]

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(
    model="xxxx/DeepSeek-V2-Lite-w8a8",
    tensor_parallel_size=2,
    trust_remote_code=True,
    )

# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}\n, Generated text: {generated_text!r}")

result:

INFO 05-16 07:53:51 [executor_base.py:112] # npu blocks: 9065, # CPU blocks: 1078
INFO 05-16 07:53:51 [executor_base.py:117] Maximum concurrency for 163840 tokens per request: 7.08x
INFO 05-16 07:53:53 [llm_engine.py:437] init engine (profile, create kv cache, warmup model) took 12.23 seconds
Processed prompts: 100%|_________________________________________________________________| 3/3 [00:01<00:00,  1.57it/s, est. speed input: 8.91 toks/s, output: 25.15 toks/s]
Prompt: 'How are you?'
, Generated text: '?!?!?!!!,../_,....., etc))?,_'
Prompt: 'Please introduce China'
, Generated text: '),,..,_,,, I, ",),, etc,'
Prompt: 'Is Shanghai the capital of China?'
, Generated text: '\nWhat do Chinese capital?\nWhat is Chinese capital?\nWhat is Chinese'

Potabk · 2025-05-16T08:51:38Z

The latest suitable tag for msmodelslim is during the adaptation, please pay attention to the document update

tangzhiyi11 · 2025-05-16T09:00:59Z

The latest suitable tag for msmodelslim is during the adaptation, please pay attention to the document update

@Potabk will the model on https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-w8a8 also be updated? I directly downloaded the model without using msmodelslim for quantization.

Potabk · 2025-05-16T09:05:38Z

@tangzhiyi11 this model is the artificial strictly follow the https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_npu_quantization.html , once the msmodelslim tag is landed, will update the weight

tangzhiyi11 · 2025-05-16T09:08:56Z

@tangzhiyi11 this model is the artificial strictly follow the https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_npu_quantization.html , once the msmodelslim tag is landed, will update the weight

@Potabk ok, thx. By the way, where can I download the weights for deepseek-r1-w8a8?

22dimensions · 2025-05-16T09:11:01Z

@tangzhiyi11 this model is the artificial strictly follow the https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_npu_quantization.html , once the msmodelslim tag is landed, will update the weight

@Potabk ok, thx. By the way, where can I download the weights for deepseek-r1-w8a8?

https://www.modelscope.cn/models/vllm-ascend/DeepSeek-R1-W8A8. We are uploading the weights.

tangzhiyi11 · 2025-05-19T01:33:32Z

@tangzhiyi11 this model is the artificial strictly follow the https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_npu_quantization.html , once the msmodelslim tag is landed, will update the weight

@Potabk ok, thx. By the way, where can I download the weights for deepseek-r1-w8a8?

https://www.modelscope.cn/models/vllm-ascend/DeepSeek-R1-W8A8. We are uploading the weights.

@22dimensions I have downloaded the weights for DeepSeek-R1-W8A8, and I have a few questions regarding them that I would like to ask for clarification.

Is the configuration_deepseek.py file missing from the weights package?
In the quant_model_description.json file, I couldn't find the weights for model.norm.weight and lm_head.weight. Is this expected, or should these weights be included?

tangzhiyi11 added the bug Something isn't working label May 16, 2025

tangzhiyi11 mentioned this issue May 16, 2025

[Bug]: deepseek-v2-lite-w8a8 quantizaion inference repeated output #628

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: deepseek-v2-lite-w8a8 精度不对 #883

[Bug]: deepseek-v2-lite-w8a8 精度不对 #883

tangzhiyi11 commented May 16, 2025

tangzhiyi11 commented May 16, 2025

Uh oh!

learning-chip commented May 16, 2025

Uh oh!

tangzhiyi11 commented May 16, 2025

Uh oh!

Potabk commented May 16, 2025

Uh oh!

tangzhiyi11 commented May 16, 2025

Uh oh!

Potabk commented May 16, 2025

Uh oh!

tangzhiyi11 commented May 16, 2025

Uh oh!

22dimensions commented May 16, 2025

Uh oh!

tangzhiyi11 commented May 19, 2025

Uh oh!

[Bug]: deepseek-v2-lite-w8a8 精度不对 #883

[Bug]: deepseek-v2-lite-w8a8 精度不对 #883

Comments

tangzhiyi11 commented May 16, 2025

Your current environment

🐛 Describe the bug

tangzhiyi11 commented May 16, 2025

Uh oh!

learning-chip commented May 16, 2025

Uh oh!

tangzhiyi11 commented May 16, 2025

Uh oh!

Potabk commented May 16, 2025

Uh oh!

tangzhiyi11 commented May 16, 2025

Uh oh!

Potabk commented May 16, 2025

Uh oh!

tangzhiyi11 commented May 16, 2025

Uh oh!

22dimensions commented May 16, 2025

Uh oh!

tangzhiyi11 commented May 19, 2025

Uh oh!