Skip to content

[Bug]: vllm 0.7.3 v1 engine do not support Baichuan model #866

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
kevin-hongkai opened this issue May 15, 2025 · 2 comments
Open

[Bug]: vllm 0.7.3 v1 engine do not support Baichuan model #866

kevin-hongkai opened this issue May 15, 2025 · 2 comments
Labels
bug Something isn't working

Comments

@kevin-hongkai
Copy link

kevin-hongkai commented May 15, 2025

Your current environment

vllm 0.7.3 do not support Baichuan model enviroment: vllm 0.7.3 cann8.1.rc1 torch2.5.1 torch-npu 2.5.1 , use the docs: https://vllm-ascend.readthedocs.io/en/v0.7.3/installation.html to run baichuan model

🐛 Describe the bug

1、enviroment: vllm 0.7.3 cann8.1.rc1 torch2.5.1 torch-npu 2.5.1 , use the docs: https://vllm-ascend.readthedocs.io/en/v0.7.3/installation.html to run baichuan model

2、command:
export VLLM_USE_V1=1
export VLLM_WORKER_MULTIPROC_METHOD=spawn
python3 vllm/benchmarks/benchmark_throughput.py --model baichuan-inc/Baichuan2-7B-Chat --input-len 512 --output-len 1 --tensor-parallel-size 1 --num-prompts 300 --disable-custom-all-reduce --trust-remote-code

3、result error:
CANN_VERSION : 8.1.RC1
opp/built-in/op_impl/ai_core/tbe/impl/dynamic/../ascendc/gather_v3/gather_v3_base.h:137
Assertion `(0 <= val && val < this->gxSize_)' Index 5330 out of range[0 4096)!

AIV Kernel happen error, retCode=0x31.[FUNC:GetError][FILE:stream.cc][LINE:1119]
Aicore kernel execute failed, device_id=2, stream_id=2, report_stream_id=2, task_id=96, flip_num=0, fault kernel_name=GatherV3_7869a97190b9b4d296d9414a005b954b_high_performance_10330, fault kernel info ext=none,

4、vllm 0.8.5post1, vllm-ascend 0.8.5rc1 v1 engine can run baichuan model, hope the modify of 0.8.5rc1 can merge to 0.7.3

@kevin-hongkai kevin-hongkai added the bug Something isn't working label May 15, 2025
@Yikun
Copy link
Collaborator

Yikun commented May 15, 2025

I try to reprodce:

vLLM Ascend

# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci0
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:v0.7.3
docker run --rm \
--name yikun-test \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash

python3 /vllm-workspace/vllm/benchmarks/benchmark_throughput.py --model ~/.cache/modelscope/hub/baichuan-inc/Baichuan2-7B-Chat --input-len 512 --output-len 1 --tensor-parallel-size 1 --num-prompts 300 --disable-custom-all-reduce --trust-remote-code
INFO 05-15 15:42:43 model_runner.py:907] Loading model weights took 13.9831 GB
INFO 05-15 15:42:46 executor_base.py:111] # npu blocks: 613, # CPU blocks: 64
INFO 05-15 15:42:46 executor_base.py:116] Maximum concurrency for 4096 tokens per request: 19.16x
INFO 05-15 15:42:50 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 6.36 seconds
Processed prompts:  30%| 89/300 [00:03<00:07, 30.03it/s, est. speed input: 13230.79 toks/s, output: 25.84 toks/s]
INFO 05-15 15:42:55 metrics.py:455] Avg prompt throughput: 10267.8 tokens/s, Avg generation throughput: 20.1 tokens/s, Running: 8 reqs, Swapped: 0 reqs, Pending: 188 reqs, GPU KV cache us
age: 5.2%, CPU KV cache usage: 0.0%.
Processed prompts:  83%| | 249/300 [00:08<00:01, 30.44it/s, est. speed input: 14653.42 toks/s, output: 28.62 toks/s]
INFO 05-15 15:43:00 metrics.py:455] Avg prompt throughput: 15582.0 tokens/s, Avg generation throughput: 30.4 tokens/s, Running: 8 reqs, Swapped: 0 reqs, Pending: 28 reqs, GPU KV cache usa
ge: 5.2%, CPU KV cache usage: 0.0%.
Processed prompts: 100%|| 300/300 [00:09<00:00, 30.18it/s, est. speed input: 15454.29 toks/s, output: 30.18 toks/s]
Throughput: 25.78 requests/s, 13224.37 total tokens/s, 25.78 output tokens/s

vLLM Ascend + MindIE Turbo

Also after install mindie-turbo, also works:

# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci0
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:v0.7.3
docker run --rm \
--name yikun-test \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash

pip install mindie-turbo

python3 /vllm-workspace/vllm/benchmarks/benchmark_throughput.py --model ~/.cache/modelscope/hub/baichuan-inc/Baichuan2-7B-Chat --input-len 512 --output-len 1 --tensor-parallel-size 1 --num-prompts 300 --disable-custom-all-reduce --trust-remote-code

INFO 05-15 15:45:08 model_runner.py:907] Loading model weights took 13.9830 GB
INFO 05-15 15:45:10 executor_base.py:111] # npu blocks: 611, # CPU blocks: 64
INFO 05-15 15:45:10 executor_base.py:116] Maximum concurrency for 4096 tokens per request: 19.09x
INFO 05-15 15:45:11 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 2.70 seconds
Processed prompts:  27% 81/300 [00:03<00:08, 26.72it/s, est. speed input: 11726.21 toks/s, output: 22.90 toks/s]
INFO 05-15 15:45:16 metrics.py:455] Avg prompt throughput: 9393.1 tokens/s, Avg generation throughput: 18.3 tokens/s, Running: 8 reqs, Swapped: 0 reqs, Pending: 196 reqs, GPU KV cache usage: 5.2%, CPU KV cache usage: 0.0%.
Processed prompts:  75% 225/300 [00:08<00:02, 27.25it/s, est. speed input: 13050.10 toks/s, output: 25.49 toks/s]
INFO 05-15 15:45:21 metrics.py:455] Avg prompt throughput: 13942.3 tokens/s, Avg generation throughput: 27.2 tokens/s, Running: 8 reqs, Swapped: 0 reqs, Pending: 52 reqs, GPU KV cache usage: 5.2%, CPU KV cache usage: 0.0%.
Processed prompts: 100% 300/300 [00:11<00:00, 27.13it/s, est. speed input: 13889.58 toks/s, output: 27.13 toks/s]
Throughput: 23.58 requests/s, 12096.62 total tokens/s, 23.58 output tokens/s

Seems all is ok here.

@kevin-hongkai kevin-hongkai changed the title [Bug]: vllm 0.7.3 do not support Baichuan model [Bug]: vllm 0.7.3 v1 engine do not support Baichuan model May 16, 2025
@kevin-hongkai
Copy link
Author

I try to reprodce:

vLLM Ascend

# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci0
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:v0.7.3
docker run --rm \
--name yikun-test \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash

python3 /vllm-workspace/vllm/benchmarks/benchmark_throughput.py --model ~/.cache/modelscope/hub/baichuan-inc/Baichuan2-7B-Chat --input-len 512 --output-len 1 --tensor-parallel-size 1 --num-prompts 300 --disable-custom-all-reduce --trust-remote-code
INFO 05-15 15:42:43 model_runner.py:907] Loading model weights took 13.9831 GB
INFO 05-15 15:42:46 executor_base.py:111] # npu blocks: 613, # CPU blocks: 64
INFO 05-15 15:42:46 executor_base.py:116] Maximum concurrency for 4096 tokens per request: 19.16x
INFO 05-15 15:42:50 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 6.36 seconds
Processed prompts:  30%| 89/300 [00:03<00:07, 30.03it/s, est. speed input: 13230.79 toks/s, output: 25.84 toks/s]
INFO 05-15 15:42:55 metrics.py:455] Avg prompt throughput: 10267.8 tokens/s, Avg generation throughput: 20.1 tokens/s, Running: 8 reqs, Swapped: 0 reqs, Pending: 188 reqs, GPU KV cache us
age: 5.2%, CPU KV cache usage: 0.0%.
Processed prompts:  83%| | 249/300 [00:08<00:01, 30.44it/s, est. speed input: 14653.42 toks/s, output: 28.62 toks/s]
INFO 05-15 15:43:00 metrics.py:455] Avg prompt throughput: 15582.0 tokens/s, Avg generation throughput: 30.4 tokens/s, Running: 8 reqs, Swapped: 0 reqs, Pending: 28 reqs, GPU KV cache usa
ge: 5.2%, CPU KV cache usage: 0.0%.
Processed prompts: 100%|| 300/300 [00:09<00:00, 30.18it/s, est. speed input: 15454.29 toks/s, output: 30.18 toks/s]
Throughput: 25.78 requests/s, 13224.37 total tokens/s, 25.78 output tokens/s

vLLM Ascend + MindIE Turbo

Also after install mindie-turbo, also works:

# Update DEVICE according to your device (/dev/davinci[0-7])
export DEVICE=/dev/davinci0
export IMAGE=m.daocloud.io/quay.io/ascend/vllm-ascend:v0.7.3
docker run --rm \
--name yikun-test \
--device $DEVICE \
--device /dev/davinci_manager \
--device /dev/devmm_svm \
--device /dev/hisi_hdc \
-v /usr/local/dcmi:/usr/local/dcmi \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/Ascend/driver/lib64/:/usr/local/Ascend/driver/lib64/ \
-v /usr/local/Ascend/driver/version.info:/usr/local/Ascend/driver/version.info \
-v /etc/ascend_install.info:/etc/ascend_install.info \
-v /root/.cache:/root/.cache \
-it $IMAGE bash

pip install mindie-turbo

python3 /vllm-workspace/vllm/benchmarks/benchmark_throughput.py --model ~/.cache/modelscope/hub/baichuan-inc/Baichuan2-7B-Chat --input-len 512 --output-len 1 --tensor-parallel-size 1 --num-prompts 300 --disable-custom-all-reduce --trust-remote-code
INFO 05-15 15:45:08 model_runner.py:907] Loading model weights took 13.9830 GB
INFO 05-15 15:45:10 executor_base.py:111] # npu blocks: 611, # CPU blocks: 64
INFO 05-15 15:45:10 executor_base.py:116] Maximum concurrency for 4096 tokens per request: 19.09x
INFO 05-15 15:45:11 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 2.70 seconds
Processed prompts:  27% 81/300 [00:03<00:08, 26.72it/s, est. speed input: 11726.21 toks/s, output: 22.90 toks/s]
INFO 05-15 15:45:16 metrics.py:455] Avg prompt throughput: 9393.1 tokens/s, Avg generation throughput: 18.3 tokens/s, Running: 8 reqs, Swapped: 0 reqs, Pending: 196 reqs, GPU KV cache usage: 5.2%, CPU KV cache usage: 0.0%.
Processed prompts:  75% 225/300 [00:08<00:02, 27.25it/s, est. speed input: 13050.10 toks/s, output: 25.49 toks/s]
INFO 05-15 15:45:21 metrics.py:455] Avg prompt throughput: 13942.3 tokens/s, Avg generation throughput: 27.2 tokens/s, Running: 8 reqs, Swapped: 0 reqs, Pending: 52 reqs, GPU KV cache usage: 5.2%, CPU KV cache usage: 0.0%.
Processed prompts: 100% 300/300 [00:11<00:00, 27.13it/s, est. speed input: 13889.58 toks/s, output: 27.13 toks/s]
Throughput: 23.58 requests/s, 12096.62 total tokens/s, 23.58 output tokens/s

Seems all is ok here.

sorry, forget to write, this happen in the vllm v1 engine, should open v1 engine:
export VLLM_USE_V1=1
export VLLM_WORKER_MULTIPROC_METHOD=spawn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants