[Bug]: vllm 0.7.3 v1 engine do not support Baichuan model #866

kevin-hongkai · 2025-05-15T06:50:11Z

Your current environment

vllm 0.7.3 do not support Baichuan model

enviroment: vllm 0.7.3 cann8.1.rc1 torch2.5.1 torch-npu 2.5.1 , use the docs: https://vllm-ascend.readthedocs.io/en/v0.7.3/installation.html to run baichuan model

🐛 Describe the bug

1、enviroment: vllm 0.7.3 cann8.1.rc1 torch2.5.1 torch-npu 2.5.1 , use the docs: https://vllm-ascend.readthedocs.io/en/v0.7.3/installation.html to run baichuan model

2、command:
export VLLM_USE_V1=1
export VLLM_WORKER_MULTIPROC_METHOD=spawn
python3 vllm/benchmarks/benchmark_throughput.py --model baichuan-inc/Baichuan2-7B-Chat --input-len 512 --output-len 1 --tensor-parallel-size 1 --num-prompts 300 --disable-custom-all-reduce --trust-remote-code

3、result error:
CANN_VERSION : 8.1.RC1
opp/built-in/op_impl/ai_core/tbe/impl/dynamic/../ascendc/gather_v3/gather_v3_base.h:137
Assertion `(0 <= val && val < this->gxSize_)' Index 5330 out of range[0 4096)!

AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1119]
Aicore kernel execute failed, device_id=2, stream_id=2, report_stream_id=2, task_id=96, flip_num=0, fault kernel_name=GatherV3_7869a97190b9b4d296d9414a005b954b_high_performance_10330, fault kernel info ext=none,

4、vllm 0.8.5post1, vllm-ascend 0.8.5rc1 v1 engine can run baichuan model, hope the modify of 0.8.5rc1 can merge to 0.7.3

Yikun · 2025-05-15T15:52:00Z

I try to reprodce:

vLLM Ascend

# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci0
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:v0.7.3
docker run --rm \
--name yikun-test \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash

python3 /vllm-workspace/vllm/benchmarks/benchmark_throughput.py --model ~/.cache/modelscope/hub/baichuan-inc/Baichuan2-7B-Chat --input-len 512 --output-len 1 --tensor-parallel-size 1 --num-prompts 300 --disable-custom-all-reduce --trust-remote-code

INFO 05-15 15:42:43 model_runner.py:907] Loading model weights took 13.9831 GB
INFO 05-15 15:42:46 executor_base.py:111] # npu blocks: 613, # CPU blocks: 64
INFO 05-15 15:42:46 executor_base.py:116] Maximum concurrency for 4096 tokens per request: 19.16x
INFO 05-15 15:42:50 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 6.36 seconds
Processed prompts:  30%| 89/300 [00:03<00:07, 30.03it/s, est. speed input: 13230.79 toks/s, output: 25.84 toks/s]
INFO 05-15 15:42:55 metrics.py:455] Avg prompt throughput: 10267.8 tokens/s, Avg generation throughput: 20.1 tokens/s, Running: 8 reqs, Swapped: 0 reqs, Pending: 188 reqs, GPU KV cache us
age: 5.2%, CPU KV cache usage: 0.0%.
Processed prompts:  83%| | 249/300 [00:08<00:01, 30.44it/s, est. speed input: 14653.42 toks/s, output: 28.62 toks/s]
INFO 05-15 15:43:00 metrics.py:455] Avg prompt throughput: 15582.0 tokens/s, Avg generation throughput: 30.4 tokens/s, Running: 8 reqs, Swapped: 0 reqs, Pending: 28 reqs, GPU KV cache usa
ge: 5.2%, CPU KV cache usage: 0.0%.
Processed prompts: 100%|| 300/300 [00:09<00:00, 30.18it/s, est. speed input: 15454.29 toks/s, output: 30.18 toks/s]
Throughput: 25.78 requests/s, 13224.37 total tokens/s, 25.78 output tokens/s

vLLM Ascend + MindIE Turbo

Also after install mindie-turbo, also works:

# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci0
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:v0.7.3
docker run --rm \
--name yikun-test \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash

pip install mindie-turbo

python3 /vllm-workspace/vllm/benchmarks/benchmark_throughput.py --model ~/.cache/modelscope/hub/baichuan-inc/Baichuan2-7B-Chat --input-len 512 --output-len 1 --tensor-parallel-size 1 --num-prompts 300 --disable-custom-all-reduce --trust-remote-code

INFO 05-15 15:45:08 model_runner.py:907] Loading model weights took 13.9830 GB
INFO 05-15 15:45:10 executor_base.py:111] # npu blocks: 611, # CPU blocks: 64
INFO 05-15 15:45:10 executor_base.py:116] Maximum concurrency for 4096 tokens per request: 19.09x
INFO 05-15 15:45:11 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 2.70 seconds
Processed prompts:  27% 81/300 [00:03<00:08, 26.72it/s, est. speed input: 11726.21 toks/s, output: 22.90 toks/s]
INFO 05-15 15:45:16 metrics.py:455] Avg prompt throughput: 9393.1 tokens/s, Avg generation throughput: 18.3 tokens/s, Running: 8 reqs, Swapped: 0 reqs, Pending: 196 reqs, GPU KV cache usage: 5.2%, CPU KV cache usage: 0.0%.
Processed prompts:  75% 225/300 [00:08<00:02, 27.25it/s, est. speed input: 13050.10 toks/s, output: 25.49 toks/s]
INFO 05-15 15:45:21 metrics.py:455] Avg prompt throughput: 13942.3 tokens/s, Avg generation throughput: 27.2 tokens/s, Running: 8 reqs, Swapped: 0 reqs, Pending: 52 reqs, GPU KV cache usage: 5.2%, CPU KV cache usage: 0.0%.
Processed prompts: 100% 300/300 [00:11<00:00, 27.13it/s, est. speed input: 13889.58 toks/s, output: 27.13 toks/s]
Throughput: 23.58 requests/s, 12096.62 total tokens/s, 23.58 output tokens/s

Seems all is ok here.

kevin-hongkai · 2025-05-16T01:05:43Z

I try to reprodce:

vLLM Ascend

# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci0
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:v0.7.3
docker run --rm \
--name yikun-test \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash

python3 /vllm-workspace/vllm/benchmarks/benchmark_throughput.py --model ~/.cache/modelscope/hub/baichuan-inc/Baichuan2-7B-Chat --input-len 512 --output-len 1 --tensor-parallel-size 1 --num-prompts 300 --disable-custom-all-reduce --trust-remote-code

INFO 05-15 15:42:43 model_runner.py:907] Loading model weights took 13.9831 GB
INFO 05-15 15:42:46 executor_base.py:111] # npu blocks: 613, # CPU blocks: 64
INFO 05-15 15:42:46 executor_base.py:116] Maximum concurrency for 4096 tokens per request: 19.16x
INFO 05-15 15:42:50 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 6.36 seconds
Processed prompts:  30%| 89/300 [00:03<00:07, 30.03it/s, est. speed input: 13230.79 toks/s, output: 25.84 toks/s]
INFO 05-15 15:42:55 metrics.py:455] Avg prompt throughput: 10267.8 tokens/s, Avg generation throughput: 20.1 tokens/s, Running: 8 reqs, Swapped: 0 reqs, Pending: 188 reqs, GPU KV cache us
age: 5.2%, CPU KV cache usage: 0.0%.
Processed prompts:  83%| | 249/300 [00:08<00:01, 30.44it/s, est. speed input: 14653.42 toks/s, output: 28.62 toks/s]
INFO 05-15 15:43:00 metrics.py:455] Avg prompt throughput: 15582.0 tokens/s, Avg generation throughput: 30.4 tokens/s, Running: 8 reqs, Swapped: 0 reqs, Pending: 28 reqs, GPU KV cache usa
ge: 5.2%, CPU KV cache usage: 0.0%.
Processed prompts: 100%|| 300/300 [00:09<00:00, 30.18it/s, est. speed input: 15454.29 toks/s, output: 30.18 toks/s]
Throughput: 25.78 requests/s, 13224.37 total tokens/s, 25.78 output tokens/s

vLLM Ascend + MindIE Turbo

Also after install mindie-turbo, also works:

# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci0
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:v0.7.3
docker run --rm \
--name yikun-test \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash

pip install mindie-turbo

python3 /vllm-workspace/vllm/benchmarks/benchmark_throughput.py --model ~/.cache/modelscope/hub/baichuan-inc/Baichuan2-7B-Chat --input-len 512 --output-len 1 --tensor-parallel-size 1 --num-prompts 300 --disable-custom-all-reduce --trust-remote-code

INFO 05-15 15:45:08 model_runner.py:907] Loading model weights took 13.9830 GB
INFO 05-15 15:45:10 executor_base.py:111] # npu blocks: 611, # CPU blocks: 64
INFO 05-15 15:45:10 executor_base.py:116] Maximum concurrency for 4096 tokens per request: 19.09x
INFO 05-15 15:45:11 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 2.70 seconds
Processed prompts:  27% 81/300 [00:03<00:08, 26.72it/s, est. speed input: 11726.21 toks/s, output: 22.90 toks/s]
INFO 05-15 15:45:16 metrics.py:455] Avg prompt throughput: 9393.1 tokens/s, Avg generation throughput: 18.3 tokens/s, Running: 8 reqs, Swapped: 0 reqs, Pending: 196 reqs, GPU KV cache usage: 5.2%, CPU KV cache usage: 0.0%.
Processed prompts:  75% 225/300 [00:08<00:02, 27.25it/s, est. speed input: 13050.10 toks/s, output: 25.49 toks/s]
INFO 05-15 15:45:21 metrics.py:455] Avg prompt throughput: 13942.3 tokens/s, Avg generation throughput: 27.2 tokens/s, Running: 8 reqs, Swapped: 0 reqs, Pending: 52 reqs, GPU KV cache usage: 5.2%, CPU KV cache usage: 0.0%.
Processed prompts: 100% 300/300 [00:11<00:00, 27.13it/s, est. speed input: 13889.58 toks/s, output: 27.13 toks/s]
Throughput: 23.58 requests/s, 12096.62 total tokens/s, 23.58 output tokens/s

Seems all is ok here.

sorry, forget to write, this happen in the vllm v1 engine, should open v1 engine:
export VLLM_USE_V1=1
export VLLM_WORKER_MULTIPROC_METHOD=spawn

kevin-hongkai added the bug Something isn't working label May 15, 2025

kevin-hongkai changed the title ~~[Bug]: vllm 0.7.3 do not support Baichuan model~~ [Bug]: vllm 0.7.3 v1 engine do not support Baichuan model May 16, 2025

wangxiyuan mentioned this issue May 20, 2025

[Guide] Official Guide Index #840

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: vllm 0.7.3 v1 engine do not support Baichuan model #866

[Bug]: vllm 0.7.3 v1 engine do not support Baichuan model #866

kevin-hongkai commented May 15, 2025 •

edited

Loading

Yikun commented May 15, 2025 •

edited

Loading

Uh oh!

kevin-hongkai commented May 16, 2025

vLLM Ascend

vLLM Ascend + MindIE Turbo

Uh oh!

[Bug]: vllm 0.7.3 v1 engine do not support Baichuan model #866

[Bug]: vllm 0.7.3 v1 engine do not support Baichuan model #866

Comments

kevin-hongkai commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Your current environment

🐛 Describe the bug

Yikun commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

vLLM Ascend

vLLM Ascend + MindIE Turbo

Uh oh!

kevin-hongkai commented May 16, 2025

vLLM Ascend

vLLM Ascend + MindIE Turbo

Uh oh!

kevin-hongkai commented May 15, 2025 •

edited

Loading

Yikun commented May 15, 2025 •

edited

Loading