Skip to content

[Bug]: deepseek-v2-lite-w8a8 精度不对 #883

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tangzhiyi11 opened this issue May 16, 2025 · 9 comments
Open

[Bug]: deepseek-v2-lite-w8a8 精度不对 #883

tangzhiyi11 opened this issue May 16, 2025 · 9 comments
Labels
bug Something isn't working

Comments

@tangzhiyi11
Copy link

Your current environment

Your output of above commands here

🐛 Describe the bug

下载的模型地址:https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-w8a8

启动服务:

vllm serve /home/weight/DeepSeek-V2-Lite-w8a8  --tensor-parallel-size 4 --trust-remote-code --served-model-name "dpsk-w8a8" --max-model-len 4096

请求:

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "dpsk-w8a8",
        "prompt": "what is deepseek?",
        "max_tokens": "128",
        "top_p": "0.95",
        "top_k": "40",
        "temperature": "0.0"
    }'

返回结果:

{"id":"cmpl-2ef69dc2ac964e8aa3dafa6dcaee78a5","object":"text_completion","created":1747378218,"model":"dpsk-w8a8","choices":[{"index":0,"text":"\n)),)_)))....\".\".\".\".\"................................................................................................................................................................................................................................","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":6,"total_tokens":134,"completion_tokens":128,"prompt_tokens_details":null}}

完全参照 https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_npu_quantization.html 操作。

环境:
vllm-ascend main 分支
vllm v0.8.5.post1
cann: 8.1.RC1
npu: 910b

@tangzhiyi11 tangzhiyi11 added the bug Something isn't working label May 16, 2025
@tangzhiyi11
Copy link
Author

离线测试:

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(model="/xxxxl/DeepSeek-V2-Lite-w8a8", trust_remote_code=True)

# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}\n, Generated text: {generated_text!r}")

结果:

INFO 05-16 07:08:34 [parallel_state.py:1004] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-16 07:08:34 [model_runner.py:953] Starting to load model /mnt/cwai/pjlab_data_new_hpfs/dev/share/deepseek_model/DeepSeek-V2-Lite-w8a8...
INFO 05-16 07:08:34 [quantizer.py:88] Using the vLLM Ascend Quantizer version now!
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:02<00:06,  2.10s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:04<00:01,  1.29s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:05<00:00,  1.31s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:05<00:00,  1.37s/it]

INFO 05-16 07:08:40 [loader.py:458] Loading weights took 5.48 seconds
INFO 05-16 07:08:40 [model_runner.py:958] Loading model weights took 15.2719 GB
[rank0]:[W516 07:08:43.716983494 compiler_depend.ts:28] Warning: The oprator of MoeInitRouting will be removed from Pytorch and switch to AscendSpeed after 630. (function operator())
INFO 05-16 07:08:49 [executor_base.py:112] # npu blocks: 4232, # CPU blocks: 1078
INFO 05-16 07:08:49 [executor_base.py:117] Maximum concurrency for 163840 tokens per request: 3.31x
INFO 05-16 07:08:49 [llm_engine.py:437] init engine (profile, create kv cache, warmup model) took 8.78 seconds
Processed prompts: 100%|________________________________________________________________| 4/4 [00:01<00:00,  2.99it/s, est. speed input: 19.40 toks/s, output: 47.76 toks/s]
Prompt: 'Hello, my name is'
, Generated text: ' Hello Hello. Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello Hello'
Prompt: 'The president of the United States is'
, Generated text: ' The The...The...TheTheOntOntOntOntOntOntOntOntOnt'
Prompt: 'The capital of France is'
, Generated text: ' the...You", the capital at France is the.\n], The capital at'
Prompt: 'The future of AI is'
, Generated text: ' an AI.\n\n, ALAL AL AL AL AL AL AL AL AL'

@learning-chip
Copy link

I can get correct output with TP=2 following: #628 (comment). TP=1 still needs to be fixed though.

@tangzhiyi11
Copy link
Author

I can get correct output with TP=2 following: #628 (comment). TP=1 still needs to be fixed though.

@learning-chip I tested tp=2. but got wrong output.

code:

from vllm import LLM, SamplingParams

prompts = [
    "How are you?",
    "Please introduce China",
    "Is Shanghai the capital of China?",
]

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
# Create an LLM.
llm = LLM(
    model="xxxx/DeepSeek-V2-Lite-w8a8",
    tensor_parallel_size=2,
    trust_remote_code=True,
    )

# Generate texts from the prompts.
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}\n, Generated text: {generated_text!r}")

result:

INFO 05-16 07:53:51 [executor_base.py:112] # npu blocks: 9065, # CPU blocks: 1078
INFO 05-16 07:53:51 [executor_base.py:117] Maximum concurrency for 163840 tokens per request: 7.08x
INFO 05-16 07:53:53 [llm_engine.py:437] init engine (profile, create kv cache, warmup model) took 12.23 seconds
Processed prompts: 100%|_________________________________________________________________| 3/3 [00:01<00:00,  1.57it/s, est. speed input: 8.91 toks/s, output: 25.15 toks/s]
Prompt: 'How are you?'
, Generated text: '?!?!?!!!,../_,....., etc))?,_'
Prompt: 'Please introduce China'
, Generated text: '),,..,_,,, I, ",),, etc,'
Prompt: 'Is Shanghai the capital of China?'
, Generated text: '\nWhat do Chinese capital?\nWhat is Chinese capital?\nWhat is Chinese'

@Potabk
Copy link
Contributor

Potabk commented May 16, 2025

The latest suitable tag for msmodelslim is during the adaptation, please pay attention to the document update

@tangzhiyi11
Copy link
Author

The latest suitable tag for msmodelslim is during the adaptation, please pay attention to the document update

@Potabk will the model on https://www.modelscope.cn/models/vllm-ascend/DeepSeek-V2-Lite-w8a8 also be updated? I directly downloaded the model without using msmodelslim for quantization.

@Potabk
Copy link
Contributor

Potabk commented May 16, 2025

@tangzhiyi11 this model is the artificial strictly follow the https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_npu_quantization.html , once the msmodelslim tag is landed, will update the weight

@tangzhiyi11
Copy link
Author

@tangzhiyi11 this model is the artificial strictly follow the https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_npu_quantization.html , once the msmodelslim tag is landed, will update the weight

@Potabk ok, thx. By the way, where can I download the weights for deepseek-r1-w8a8?

@22dimensions
Copy link
Contributor

@tangzhiyi11 this model is the artificial strictly follow the https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_npu_quantization.html , once the msmodelslim tag is landed, will update the weight

@Potabk ok, thx. By the way, where can I download the weights for deepseek-r1-w8a8?

https://www.modelscope.cn/models/vllm-ascend/DeepSeek-R1-W8A8. We are uploading the weights.

@tangzhiyi11
Copy link
Author

@tangzhiyi11 this model is the artificial strictly follow the https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_npu_quantization.html , once the msmodelslim tag is landed, will update the weight

@Potabk ok, thx. By the way, where can I download the weights for deepseek-r1-w8a8?

https://www.modelscope.cn/models/vllm-ascend/DeepSeek-R1-W8A8. We are uploading the weights.

@22dimensions I have downloaded the weights for DeepSeek-R1-W8A8, and I have a few questions regarding them that I would like to ask for clarification.

  • Is the configuration_deepseek.py file missing from the weights package?
  • In the quant_model_description.json file, I couldn't find the weights for model.norm.weight and lm_head.weight. Is this expected, or should these weights be included?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants