-
Notifications
You must be signed in to change notification settings - Fork 159
[Bug]: deepseek-v2-lite-w8a8 quantizaion inference repeated output #628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
update:
same client input:
the output:
I want to know What is the difference and what is the impact? |
#453 let's put the problem and feedback here |
I hit a similar issue. The model weights are created by strictly following commands in https://vllm-ascend.readthedocs.io/en/latest/tutorials/multi_npu_quantization.html#install-modelslim-and-convert-model, including the exact commit I uploaded a copy of weights to https://huggingface.co/jay-zhuang/DeepSeek-V2-Lite-Chat-w8a8-act2-npu , which you can download and reproduce the bug below. Then, the run-time Docker environment strictly follows https://vllm-ascend.readthedocs.io/en/latest/installation.html#configure-a-new-environment The model load successfully, but I also get incorrect output using this minimum script. from vllm import LLM, SamplingParams
model_path = "/model_weights/DeepSeek-V2-Lite-Chat-w8a8-act2-npu"
model = LLM(
model=model_path,
max_num_seqs=16,
tensor_parallel_size=1,
trust_remote_code=True,
max_model_len=2048
)
prompts = [
"Hello, my name is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
outputs = model.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") Output is:
In comparison, un-quantized original version https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite-Chat is able to reproduce correct output, using the same script but just a different model path:
Quantized version with
|
#883 i got wrong output when using deepseek-v2-lite-w8a8 |
The problem is, TP=2 is very slow due to the PyTorch eager mode and the large launch overhead of allreduce. If TP=1 can work correctly, it will be much more efficient for ds-v2-lite-w8a8 model. Add this profiling to the above test script with torch_npu.profiler.profile(
on_trace_ready=torch_npu.profiler.tensorboard_trace_handler(log_dir),
activities=[torch_npu.profiler.ProfilerActivity.CPU, torch_npu.profiler.ProfilerActivity.NPU],
record_shapes=True,
profile_memory=True,
with_flops=True,
with_stack=True
) as prof:
outputs = model.generate(prompts, sampling_params) allreduce takes >60% of time: |
Uh oh!
There was an error while loading. Please reload this page.
Your current environment
The output of `python collect_env.py`
🐛 Describe the bug
I'm using deekseek-v2-lite w8a8 quantization feature, and i encountered an unexpected bug:
The model is quantized following to the https://github.com/vllm-project/vllm-ascend/pull/580#issuecomment-2816747613,
serving command:
and the server setup normally:
the client input:
but the output is strange:
The text was updated successfully, but these errors were encountered: