[Release]: vLLM Ascend v0.7.3 release checklist #644

MengqingCao · 2025-04-24T08:07:42Z

This issue tracks the checklist for official v0.7.3 release

Code develop

Wait for CANN8.1 release, then update dockerfile base image Upgrade CANN version to 8.1.rc1 #747 @Yikun
Update torch-npu to 2.5.1 official release @MengqingCao
[v0.7.3][Build] Upgrade torch-npu to 2.5.1 #662
PR waiting for merge/review/close @wangxiyuan
[0.7.3] Optimize apply_penalties & topKtopP for both V0/V1 Engine #525 @linfeng-yuan
[Doc] Update v0.7.3 faqs #695
[ModelRunnerV1] Adapt kv_cache quant in v1. #685
[Misc] Add v0.7.3 benchmark #678
[0.7.3] optimize qwen2_vl and qwen2_5_vl #702
lora support cherry-pick @paulyu12 @Yikun
Add LoRA & Multi-LoRA support for V0.7.3 dev by Cherry Pick #700
write release note @Yikun
[Doc] Add release note for 0.7.3 #735
CPU memory overleak @celestialli
[0.7.3] patch from_seq_group to clear finished seq in seq_id_to_seq_group #691

Documant enhancement

Installation @MengqingCao
[Build][0.7.3] Integrate MindIE Turbo into vLLM Ascend #708
- install from source code
  - vllm
  - vllm-ascend[mindie-turbo]
- install from binary
  - vllm
  - vllm-ascend
  - mindie-turbo
- install with docker
User Guide
- Use ascend scheduler with V1 Engine @MengqingCao [Guide]: Usage on AscendScheduler in vLLM Ascend #788
- Improve performance with python and pytorch @wangxiyuan [Doc] Add release note for 0.7.3 #735
- Update doc to address compile enhancement @MengqingCao
  [Build][0.7.3] Integrate MindIE Turbo into vLLM Ascend #708
- FAQ cherry-pick @Potabk
  [Doc] Update v0.7.3 faqs #695
  [v0.7.3][Doc] Add notes for OOM in FAQs (#786) #795
- Feature support update @MengqingCao
  [Build][0.7.3] Integrate MindIE Turbo into vLLM Ascend #708
- Model support update @MengqingCao
  [Build][0.7.3] Integrate MindIE Turbo into vLLM Ascend #708
- Accurary report @hfadzxy [v0.7.3][Doc] Add accuracy report #793
  Add index page once the report exist.
- Performance feedback issue: [Guide][Performance]: vllm-ascend v0.7.3 release performance benchmark #776 @Potabk
Developer Guide
- Update Release Compatibility Matrix include mindie-turbo verion: [Doc] Add release note for 0.7.3 #735 @Yikun

Function and Model Test

Release artifacts @wangxiyuan

accuracy report @hfadzxy [v0.7.3][Doc] Add accuracy report #793
Need generate the report by hand.
pypi package @MengqingCao https://pypi.org/project/vllm-ascend/0.7.3/
docker image @Yikun https://github.com/vllm-project/vllm-ascend/actions/runs/14872918023/job/41866668626?pr=730

Yikun · 2025-04-26T14:02:24Z

Also update feature support doc like #650, for v0.7.3: we should Add vLLM V0 and vLLM V0 (+MindIE Turbo) column and remove vLLM V1 column.

paulyu12 · 2025-04-28T02:10:26Z

@ZhengJun9 Help PR lora support cherry-pick. Thx.

shen-shanshan · 2025-04-28T03:04:59Z

Guided Decoding:

V0: outlines ✅, xgrammar (supported ✅ : Regex / JSON, unsupported ❌ : Choice / Grammar)
V1: outlines 🚧 (not precise enough), xgrammar 🚧 (not precise enough, has fixed in latest main)

V1 Engine:

Offline inference: ✅
Online inference: ✅

Distribution:

TP: V0 ✅, V1 ✅
PP: V0 (supported ✅ : online, unsupported ❌ : offline), V1 ❌ (not implemented in v1)

MengqingCao · 2025-04-28T10:58:04Z

Pooling model test pass with `test_scoring` and `test_embedding` on V0;

V1 not support because Chunked prefill is not supported for pooling models

(atb) (base) cmq@cmq-docker:~/code/vllm-ascend$ pytest -sv tests/singlecard/embedding/
/home/cmq/miniconda3/envs/atb/lib/python3.10/site-packages/pytest_asyncio/plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"

  warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
=========================================================================== test session starts ============================================================================
platform linux -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0 -- /home/cmq/miniconda3/envs/atb/bin/python
cachedir: .pytest_cache
rootdir: /home/cmq/code/vllm-ascend
configfile: pytest.ini
plugins: shard-0.1.2, markdown-docs-0.9.0, rerunfailures-15.0, md-0.2.0, asyncio-0.25.3, anyio-4.8.0, mock-3.14.0, forked-1.6.0, typeguard-4.3.0
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None
collected 7 items                                                                                                                                                          
Running 7 items in this shard: tests/singlecard/embedding/test_embedding.py::test_models[half-/home/cmq/cache/modelscope/models/BAAI/bge-base-en-v1___5], tests/singlecard/embedding/test_scoring.py::test_llm_1_to_1[BAAI/bge-reranker-v2-m3-half], tests/singlecard/embedding/test_scoring.py::test_llm_1_to_N[BAAI/bge-reranker-v2-m3-half], tests/singlecard/embedding/test_scoring.py::test_llm_N_to_N[BAAI/bge-reranker-v2-m3-half], tests/singlecard/embedding/test_scoring.py::test_llm_1_to_1_embedding[/home/cmq/cache/modelscope/models/sentence-transformers/all-MiniLM-L12-v2-half], tests/singlecard/embedding/test_scoring.py::test_llm_1_to_N_embedding[/home/cmq/cache/modelscope/models/sentence-transformers/all-MiniLM-L12-v2-half], tests/singlecard/embedding/test_scoring.py::test_llm_N_to_N_embedding[/home/cmq/cache/modelscope/models/sentence-transformers/all-MiniLM-L12-v2-half]

tests/singlecard/embedding/test_embedding.py::test_models[half-/home/cmq/cache/modelscope/models/BAAI/bge-base-en-v1___5] INFO 04-28 10:42:35 __init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 04-28 10:42:35 __init__.py:32] name=ascend, value=vllm_ascend:register
INFO 04-28 10:42:35 __init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 04-28 10:42:35 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-28 10:42:35 __init__.py:44] plugin ascend loaded.
INFO 04-28 10:42:35 __init__.py:198] Platform plugin ascend is activated
INFO 04-28 10:42:35 __init__.py:30] Available plugins for group vllm.general_plugins:
INFO 04-28 10:42:35 __init__.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 04-28 10:42:35 __init__.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 04-28 10:42:35 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-28 10:42:35 __init__.py:44] plugin ascend_enhanced_model loaded.
WARNING 04-28 10:42:35 registry.py:351] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:CustomQwen2VLForConditionalGeneration.
WARNING 04-28 10:42:35 registry.py:351] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 04-28 10:42:35 registry.py:351] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 04-28 10:42:35 registry.py:351] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 04-28 10:42:35 registry.py:351] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
INFO 04-28 10:42:35 importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 04-28 10:42:35 config.py:526] Found sentence-transformers tokenize configuration.
INFO 04-28 10:42:35 config.py:2444] Downcasting torch.float32 to torch.float16.
INFO 04-28 10:42:49 config.py:422] Found sentence-transformers modules configuration.
INFO 04-28 10:42:49 config.py:442] Found pooling configuration.
INFO 04-28 10:42:49 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/home/cmq/cache/modelscope/models/BAAI/bge-base-en-v1___5', speculative_config=None, tokenizer='/home/cmq/cache/modelscope/models/BAAI/bge-base-en-v1___5', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=512, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/cmq/cache/modelscope/models/BAAI/bge-base-en-v1___5, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=PoolerConfig(pooling_type='CLS', normalize=True, softmax=None, step_tag_id=None, returned_token_ids=None), compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
WARNING 04-28 10:42:50 utils.py:2262] Methods add_lora,add_prompt_adapter,cache_config,compilation_config,current_platform,list_loras,list_prompt_adapters,load_config,pin_lora,pin_prompt_adapter,remove_lora,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffd211f1ba0>
INFO 04-28 10:42:51 model_runner.py:822] Starting to load model /home/cmq/cache/modelscope/models/BAAI/bge-base-en-v1___5...
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.47it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  5.46it/s]

INFO 04-28 10:42:53 model_runner.py:827] Loading model weights took 0.2091 GB
Processed prompts:   0%|                                                                          | 0/8 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s][rank0]:[W428 10:42:54.345549206 compiler_depend.ts:133] Warning: Failed to find function aclsysGetCANNVersion (function operator())
[rank0]:[W428 10:42:54.350248917 compiler_depend.ts:52] Warning: Version:  is invalid. (function operator())
Processed prompts: 100%|██████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00,  7.58it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
INFO 04-28 10:42:54 config.py:2444] Downcasting torch.float32 to torch.float16.
PASSED
tests/singlecard/embedding/test_scoring.py::test_llm_1_to_1[BAAI/bge-reranker-v2-m3-half] INFO 04-28 10:42:57 config.py:2444] Downcasting torch.float32 to torch.float16.
INFO 04-28 10:43:12 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='BAAI/bge-reranker-v2-m3', speculative_config=None, tokenizer='BAAI/bge-reranker-v2-m3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8194, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=BAAI/bge-reranker-v2-m3, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=PoolerConfig(pooling_type=None, normalize=None, softmax=None, step_tag_id=None, returned_token_ids=None), compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
WARNING 04-28 10:43:14 utils.py:2262] Methods add_lora,add_prompt_adapter,cache_config,compilation_config,current_platform,list_loras,list_prompt_adapters,load_config,pin_lora,pin_prompt_adapter,remove_lora,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffd0a6dedd0>
INFO 04-28 10:43:14 model_runner.py:822] Starting to load model BAAI/bge-reranker-v2-m3...
2025-04-28 10:43:14,803 - modelscope - WARNING - Authentication has expired, please re-login for uploading or accessing controlled entities.
Downloading Model to directory: /home/cmq/cache/modelscope/models/BAAI/bge-reranker-v2-m3
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.23it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.23it/s]

INFO 04-28 10:43:16 model_runner.py:827] Loading model weights took 1.0578 GB
Processed prompts: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  6.93it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
INFO 04-28 10:43:17 config.py:2444] Downcasting torch.float32 to torch.float16.
PASSED
tests/singlecard/embedding/test_scoring.py::test_llm_1_to_N[BAAI/bge-reranker-v2-m3-half] INFO 04-28 10:43:26 config.py:2444] Downcasting torch.float32 to torch.float16.
INFO 04-28 10:43:26 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='BAAI/bge-reranker-v2-m3', speculative_config=None, tokenizer='BAAI/bge-reranker-v2-m3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8194, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=BAAI/bge-reranker-v2-m3, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=PoolerConfig(pooling_type=None, normalize=None, softmax=None, step_tag_id=None, returned_token_ids=None), compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
WARNING 04-28 10:43:29 utils.py:2262] Methods add_lora,add_prompt_adapter,cache_config,compilation_config,current_platform,list_loras,list_prompt_adapters,load_config,pin_lora,pin_prompt_adapter,remove_lora,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffcf1e14280>
INFO 04-28 10:43:29 model_runner.py:822] Starting to load model BAAI/bge-reranker-v2-m3...
Downloading Model to directory: /home/cmq/cache/modelscope/models/BAAI/bge-reranker-v2-m3
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.17it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.17it/s]

INFO 04-28 10:43:31 model_runner.py:827] Loading model weights took 1.0578 GB
Processed prompts: 100%|██████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 15.36it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
INFO 04-28 10:43:32 config.py:2444] Downcasting torch.float32 to torch.float16.
PASSED
tests/singlecard/embedding/test_scoring.py::test_llm_N_to_N[BAAI/bge-reranker-v2-m3-half] INFO 04-28 10:43:41 config.py:2444] Downcasting torch.float32 to torch.float16.
INFO 04-28 10:43:41 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='BAAI/bge-reranker-v2-m3', speculative_config=None, tokenizer='BAAI/bge-reranker-v2-m3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=8194, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=BAAI/bge-reranker-v2-m3, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=PoolerConfig(pooling_type=None, normalize=None, softmax=None, step_tag_id=None, returned_token_ids=None), compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
WARNING 04-28 10:43:45 utils.py:2262] Methods add_lora,add_prompt_adapter,cache_config,compilation_config,current_platform,list_loras,list_prompt_adapters,load_config,pin_lora,pin_prompt_adapter,remove_lora,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffcf1eaf0d0>
INFO 04-28 10:43:45 model_runner.py:822] Starting to load model BAAI/bge-reranker-v2-m3...
Downloading Model to directory: /home/cmq/cache/modelscope/models/BAAI/bge-reranker-v2-m3
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.20it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.19it/s]

INFO 04-28 10:43:46 model_runner.py:827] Loading model weights took 1.0578 GB
Processed prompts: 100%|██████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 58.98it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
INFO 04-28 10:43:47 config.py:2444] Downcasting torch.float32 to torch.float16.
PASSED
tests/singlecard/embedding/test_scoring.py::test_llm_1_to_1_embedding[/home/cmq/cache/modelscope/models/sentence-transformers/all-MiniLM-L12-v2-half] INFO 04-28 10:43:55 config.py:526] Found sentence-transformers tokenize configuration.
INFO 04-28 10:43:55 config.py:2444] Downcasting torch.float32 to torch.float16.
INFO 04-28 10:43:55 config.py:422] Found sentence-transformers modules configuration.
INFO 04-28 10:43:55 config.py:442] Found pooling configuration.
INFO 04-28 10:43:55 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/home/cmq/cache/modelscope/models/sentence-transformers/all-MiniLM-L12-v2', speculative_config=None, tokenizer='/home/cmq/cache/modelscope/models/sentence-transformers/all-MiniLM-L12-v2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=128, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/cmq/cache/modelscope/models/sentence-transformers/all-MiniLM-L12-v2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=PoolerConfig(pooling_type='MEAN', normalize=True, softmax=None, step_tag_id=None, returned_token_ids=None), compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
WARNING 04-28 10:43:55 utils.py:2262] Methods add_lora,add_prompt_adapter,cache_config,compilation_config,current_platform,list_loras,list_prompt_adapters,load_config,pin_lora,pin_prompt_adapter,remove_lora,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffcf3f8a650>
INFO 04-28 10:43:55 model_runner.py:822] Starting to load model /home/cmq/cache/modelscope/models/sentence-transformers/all-MiniLM-L12-v2...
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  7.21it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  7.19it/s]

INFO 04-28 10:43:56 model_runner.py:827] Loading model weights took 0.0633 GB
Processed prompts: 100%|██████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 50.03it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
INFO 04-28 10:43:56 config.py:2444] Downcasting torch.float32 to torch.float16.
PASSED
tests/singlecard/embedding/test_scoring.py::test_llm_1_to_N_embedding[/home/cmq/cache/modelscope/models/sentence-transformers/all-MiniLM-L12-v2-half] INFO 04-28 10:43:57 config.py:2444] Downcasting torch.float32 to torch.float16.
INFO 04-28 10:43:57 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/home/cmq/cache/modelscope/models/sentence-transformers/all-MiniLM-L12-v2', speculative_config=None, tokenizer='/home/cmq/cache/modelscope/models/sentence-transformers/all-MiniLM-L12-v2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=128, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/cmq/cache/modelscope/models/sentence-transformers/all-MiniLM-L12-v2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=PoolerConfig(pooling_type='MEAN', normalize=True, softmax=None, step_tag_id=None, returned_token_ids=None), compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
WARNING 04-28 10:43:58 utils.py:2262] Methods add_lora,add_prompt_adapter,cache_config,compilation_config,current_platform,list_loras,list_prompt_adapters,load_config,pin_lora,pin_prompt_adapter,remove_lora,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffcf1eae980>
INFO 04-28 10:43:58 model_runner.py:822] Starting to load model /home/cmq/cache/modelscope/models/sentence-transformers/all-MiniLM-L12-v2...
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.96it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  6.95it/s]

INFO 04-28 10:43:58 model_runner.py:827] Loading model weights took 0.0633 GB
Processed prompts: 100%|█████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 126.36it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
INFO 04-28 10:43:59 config.py:2444] Downcasting torch.float32 to torch.float16.
PASSED
tests/singlecard/embedding/test_scoring.py::test_llm_N_to_N_embedding[/home/cmq/cache/modelscope/models/sentence-transformers/all-MiniLM-L12-v2-half] INFO 04-28 10:44:00 config.py:2444] Downcasting torch.float32 to torch.float16.
INFO 04-28 10:44:00 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/home/cmq/cache/modelscope/models/sentence-transformers/all-MiniLM-L12-v2', speculative_config=None, tokenizer='/home/cmq/cache/modelscope/models/sentence-transformers/all-MiniLM-L12-v2', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=128, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/cmq/cache/modelscope/models/sentence-transformers/all-MiniLM-L12-v2, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=PoolerConfig(pooling_type='MEAN', normalize=True, softmax=None, step_tag_id=None, returned_token_ids=None), compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
WARNING 04-28 10:44:00 utils.py:2262] Methods add_lora,add_prompt_adapter,cache_config,compilation_config,current_platform,list_loras,list_prompt_adapters,load_config,pin_lora,pin_prompt_adapter,remove_lora,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffcf1e15b10>
INFO 04-28 10:44:00 model_runner.py:822] Starting to load model /home/cmq/cache/modelscope/models/sentence-transformers/all-MiniLM-L12-v2...
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  8.00it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  7.97it/s]

INFO 04-28 10:44:00 model_runner.py:827] Loading model weights took 0.0633 GB
Processed prompts: 100%|█████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 155.59it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
INFO 04-28 10:44:01 config.py:2444] Downcasting torch.float32 to torch.float16.
PASSED

============================================================================= warnings summary =============================================================================
tests/singlecard/embedding/test_embedding.py::test_models[half-/home/cmq/cache/modelscope/models/BAAI/bge-base-en-v1___5]
  /home/cmq/miniconda3/envs/atb/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:292: ImportWarning: 
      *************************************************************************************************************
      The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now..
      The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now..
      The backend in torch.distributed.init_process_group set to hccl now..
      The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now..
      The device parameters have been replaced with npu in the function below:
      torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.Generator, torch.set_default_device, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.Tensor.pin_memory, torch.nn.Module.to, torch.nn.Module.to_empty
      *************************************************************************************************************
      
    warnings.warn(msg, ImportWarning)

tests/singlecard/embedding/test_embedding.py::test_models[half-/home/cmq/cache/modelscope/models/BAAI/bge-base-en-v1___5]
  /home/cmq/miniconda3/envs/atb/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:247: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu.
    warnings.warn(msg, RuntimeWarning)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================================================================= 7 passed, 2 warnings in 87.57s (0:01:27) =================================================================

(same with mindie-turbo)

### What this PR does / why we need it? According to this [RFC]( #396 ) and [this](#448), we pull request relavant code to support (1) Multi-LoRA and (2) Multi-LoRA Dynamic Serving. LoRA reference is here: [LoRA reference](https://docs.vllm.ai/en/latest/features/lora.html) ### Does this PR introduce _any_ user-facing change? Following openai HTTP apis will be supported: /v1/load_lora_adapter /v1/unload_lora_adapter ### How was this patch tested? git clone [https://github.com/vllm-project/vllm.git](https://github.com/vllm-project/vllm.git) cd vllm/examples/offline_inference/ && python3 multilora_inference.py > [[Release]: vLLM Ascend v0.7.3 release checklist ](#644 (comment)) --------- Signed-off-by: paulyu <paulyu0307@gmail.com> Co-authored-by: paulyu12 <507435917@qq.com> Co-authored-by: paulyu <paulyu0307@gmail.com>

MengqingCao · 2025-04-28T11:35:26Z

Multi-step test pass with `tests/singlecard/multi_step/test_correctness_llm.py`

/home/xxx/miniconda3/envs/atb/lib/python3.10/site-packages/pytest_asyncio/plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"

  warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
=========================================================================== test session starts ============================================================================
platform linux -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0 -- /home/xxx/miniconda3/envs/atb/bin/python
cachedir: .pytest_cache
rootdir: /home/xxx/code/vllm-ascend
configfile: pytest.ini
plugins: shard-0.1.2, markdown-docs-0.9.0, rerunfailures-15.0, md-0.2.0, asyncio-0.25.3, anyio-4.8.0, mock-3.14.0, forked-1.6.0, typeguard-4.3.0
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None
collected 1 item                                                                                                                                                           
Running 1 items in this shard: tests/singlecard/multi_step/test_correctness_llm.py::test_multi_step_llm_w_prompt_logprobs[5-5-10-8-True-5-1-bfloat16-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct]

tests/singlecard/multi_step/test_correctness_llm.py::test_multi_step_llm_w_prompt_logprobs[5-5-10-8-True-5-1-bfloat16-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct] INFO 04-29 08:04:33 __init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 04-29 08:04:33 __init__.py:32] name=ascend, value=vllm_ascend:register
INFO 04-29 08:04:33 __init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 04-29 08:04:33 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-29 08:04:33 __init__.py:44] plugin ascend loaded.
INFO 04-29 08:04:33 __init__.py:198] Platform plugin ascend is activated
INFO 04-29 08:04:34 __init__.py:30] Available plugins for group vllm.general_plugins:
INFO 04-29 08:04:34 __init__.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 04-29 08:04:34 __init__.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 04-29 08:04:34 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-29 08:04:34 __init__.py:44] plugin ascend_enhanced_model loaded.
WARNING 04-29 08:04:34 registry.py:351] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:CustomQwen2VLForConditionalGeneration.
WARNING 04-29 08:04:34 registry.py:351] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 04-29 08:04:34 registry.py:351] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 04-29 08:04:34 registry.py:351] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 04-29 08:04:34 registry.py:351] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
INFO 04-29 08:04:34 importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 04-29 08:04:47 config.py:549] This model supports multiple tasks: {'generate', 'embed', 'classify', 'reward', 'score'}. Defaulting to 'generate'.
INFO 04-29 08:04:47 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct', speculative_config=None, tokenizer='/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct, num_scheduler_steps=8, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False, 
WARNING 04-29 08:04:48 utils.py:2262] Methods add_lora,add_prompt_adapter,cache_config,compilation_config,current_platform,list_loras,list_prompt_adapters,load_config,pin_lora,pin_prompt_adapter,remove_lora,remove_prompt_adapter not implemented in <vllm_ascend.worker.multi_step_worker.MultiStepWorker object at 0xfffd2df56c50>
WARNING 04-29 08:04:48 registry.py:335] `mm_limits` has already been set for model=/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct, and will be overwritten by the new values.
INFO 04-29 08:04:50 model_runner.py:822] Starting to load model /home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct...
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.54it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.53it/s]

INFO 04-29 08:04:51 model_runner.py:827] Loading model weights took 0.9277 GB
INFO 04-29 08:04:56 executor_base.py:111] # npu blocks: 215234, # CPU blocks: 21845
INFO 04-29 08:04:56 executor_base.py:116] Maximum concurrency for 1024 tokens per request: 3363.03x
INFO 04-29 08:04:56 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 5.29 seconds
Processed prompts: 100%|████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 38.20it/s, est. speed input: 760.49 toks/s, output: 191.06 toks/s]
INFO 04-29 08:05:13 config.py:549] This model supports multiple tasks: {'generate', 'embed', 'classify', 'reward', 'score'}. Defaulting to 'generate'.
INFO 04-29 08:05:13 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct', speculative_config=None, tokenizer='/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False, 
WARNING 04-29 08:05:14 utils.py:2262] Methods add_lora,add_prompt_adapter,cache_config,compilation_config,current_platform,list_loras,list_prompt_adapters,load_config,pin_lora,pin_prompt_adapter,remove_lora,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffa0403e530>
INFO 04-29 08:05:14 model_runner.py:822] Starting to load model /home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct...
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.50it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.50it/s]

INFO 04-29 08:05:14 model_runner.py:827] Loading model weights took 0.9232 GB
INFO 04-29 08:05:15 executor_base.py:111] # npu blocks: 217132, # CPU blocks: 21845
INFO 04-29 08:05:15 executor_base.py:116] Maximum concurrency for 1024 tokens per request: 3392.69x
INFO 04-29 08:05:15 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 0.87 seconds
Processed prompts: 100%|████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 41.00it/s, est. speed input: 816.15 toks/s, output: 205.05 toks/s]
PASSED

============================================================================= warnings summary =============================================================================
tests/singlecard/multi_step/test_correctness_llm.py::test_multi_step_llm_w_prompt_logprobs[5-5-10-8-True-5-1-bfloat16-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct]
  /home/xxx/miniconda3/envs/atb/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:292: ImportWarning: 
      *************************************************************************************************************
      The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now..
      The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now..
      The backend in torch.distributed.init_process_group set to hccl now..
      The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now..
      The device parameters have been replaced with npu in the function below:
      torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.Generator, torch.set_default_device, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.Tensor.pin_memory, torch.nn.Module.to, torch.nn.Module.to_empty
      *************************************************************************************************************
      
    warnings.warn(msg, ImportWarning)

tests/singlecard/multi_step/test_correctness_llm.py::test_multi_step_llm_w_prompt_logprobs[5-5-10-8-True-5-1-bfloat16-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct]
  /home/xxx/miniconda3/envs/atb/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:247: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu.
    warnings.warn(msg, RuntimeWarning)

tests/singlecard/multi_step/test_correctness_llm.py::test_multi_step_llm_w_prompt_logprobs[5-5-10-8-True-5-1-bfloat16-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct]
  /home/xxx/code/vllm-cpu/vllm/vllm/executor/uniproc_executor.py:29: ResourceWarning: unclosed <socket.socket fd=14, family=AddressFamily.AF_INET, type=SocketKind.SOCK_DGRAM, proto=0, laddr=('172.20.0.2', 54388), raddr=('8.8.8.8', 80)>
    get_ip(), get_open_port())
  Enable tracemalloc to get traceback where the object was allocated.
  See https://docs.pytest.org/en/stable/how-to/capture-warnings.html#resource-warnings for more info.

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
====================================================================== 1 passed, 3 warnings in 56.76s ======================================================================

(same with mindie-turbo)

MengqingCao · 2025-04-28T12:03:38Z

update：The output of chunked prefill could not align to transformers with both cann 8.0.0.beta1 and cann 8.1.rc1.beta1
~~chunked prefill failed with CANN 8.0.0.beta1. The output of vllm runner is []~~

TODO: test again with CANN 8.1.rc1.alpha002

failed test

import os

import pytest


from tests.model_utils import check_logprobs_close, check_outputs_equal

MODELS = [
    # "facebook/opt-125m",
    "/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct"
    # "meta-llama/Llama-3.2-1B-Instruct",
]


@pytest.mark.parametrize("model", MODELS)
@pytest.mark.parametrize("dtype", ["half"])
@pytest.mark.parametrize("max_tokens", [32])
@pytest.mark.parametrize("chunked_prefill_token_size", [1, 
                                                        4, 16
                                                        ])
@pytest.mark.parametrize("enforce_eager", [True])
# NOTE: Increasing this in this suite will fail CI because we currently cannot
# reset distributed env properly. Use a value > 1 just when you test.
@pytest.mark.parametrize("tensor_parallel_size", [1])
def test_models(
    hf_runner,
    vllm_runner,
    example_prompts,
    model: str,
    dtype: str,
    max_tokens: int,
    chunked_prefill_token_size: int,
    enforce_eager: bool,
    tensor_parallel_size: int,
) -> None:
    """
    Checks exact match decode between huggingface model and vllm runner with
    chunked prefill.
    """

    max_num_seqs = chunked_prefill_token_size
    max_num_batched_tokens = chunked_prefill_token_size

    with hf_runner(model, dtype=dtype) as hf_model:
        hf_outputs = hf_model.generate_greedy(example_prompts, max_tokens)

    with vllm_runner(
            model,
            dtype=dtype,
            max_num_batched_tokens=max_num_batched_tokens,
            enable_chunked_prefill=True,
            tensor_parallel_size=tensor_parallel_size,
            enforce_eager=enforce_eager,
            max_num_seqs=max_num_seqs,
    ) as vllm_model:
        vllm_outputs = vllm_model.generate_greedy(example_prompts, max_tokens)

    print(hf_outputs)
    print(100*"*")
    print(vllm_outputs)
    check_outputs_equal(
        outputs_0_lst=hf_outputs,
        outputs_1_lst=vllm_outputs,
        name_0="hf",
        name_1="vllm",
    )

The output of chunked prefill could not align to transformers

/home/xxx/miniconda3/envs/atb/lib/python3.10/site-packages/pytest_asyncio/plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"

  warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
=========================================================================== test session starts ============================================================================
platform linux -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0 -- /home/xxx/miniconda3/envs/atb/bin/python
cachedir: .pytest_cache
rootdir: /home/xxx/code/vllm-ascend
configfile: pytest.ini
plugins: shard-0.1.2, markdown-docs-0.9.0, rerunfailures-15.0, md-0.2.0, asyncio-0.25.3, anyio-4.8.0, mock-3.14.0, forked-1.6.0, typeguard-4.3.0
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None
collected 3 items                                                                                                                                                          
Running 3 items in this shard: tests/test_chunk_prefill.py::test_models[1-True-1-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct], tests/test_chunk_prefill.py::test_models[1-True-4-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct], tests/test_chunk_prefill.py::test_models[1-True-16-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct]

tests/test_chunk_prefill.py::test_models[1-True-1-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct] INFO 04-29 08:10:02 __init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 04-29 08:10:02 __init__.py:32] name=ascend, value=vllm_ascend:register
INFO 04-29 08:10:02 __init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 04-29 08:10:02 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-29 08:10:02 __init__.py:44] plugin ascend loaded.
INFO 04-29 08:10:02 __init__.py:198] Platform plugin ascend is activated
WARNING 04-29 08:10:02 config.py:2448] Casting torch.bfloat16 to torch.float16.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
INFO 04-29 08:10:20 __init__.py:30] Available plugins for group vllm.general_plugins:
INFO 04-29 08:10:20 __init__.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 04-29 08:10:20 __init__.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 04-29 08:10:20 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-29 08:10:20 __init__.py:44] plugin ascend_enhanced_model loaded.
WARNING 04-29 08:10:20 registry.py:351] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:CustomQwen2VLForConditionalGeneration.
WARNING 04-29 08:10:20 registry.py:351] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 04-29 08:10:20 registry.py:351] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 04-29 08:10:20 registry.py:351] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 04-29 08:10:20 registry.py:351] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
INFO 04-29 08:10:20 importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 04-29 08:10:20 config.py:2448] Casting torch.bfloat16 to torch.float16.
INFO 04-29 08:10:33 config.py:549] This model supports multiple tasks: {'embed', 'score', 'classify', 'reward', 'generate'}. Defaulting to 'generate'.
INFO 04-29 08:10:33 config.py:1555] Chunked prefill is enabled with max_num_batched_tokens=1.
INFO 04-29 08:10:33 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct', speculative_config=None, tokenizer='/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False, 
WARNING 04-29 08:10:35 utils.py:2262] Methods add_lora,add_prompt_adapter,cache_config,compilation_config,current_platform,list_loras,list_prompt_adapters,load_config,pin_lora,pin_prompt_adapter,remove_lora,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffcb00f6830>
INFO 04-29 08:10:35 model_runner.py:822] Starting to load model /home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct...
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.85it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.84it/s]

INFO 04-29 08:10:36 model_runner.py:827] Loading model weights took 0.9277 GB
INFO 04-29 08:10:39 executor_base.py:111] # npu blocks: 293888, # CPU blocks: 21845
INFO 04-29 08:10:39 executor_base.py:116] Maximum concurrency for 1024 tokens per request: 4592.00x
INFO 04-29 08:10:40 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 4.11 seconds
Processed prompts: 100%|████████████████████████████████████████████████████████████████| 8/8 [00:11<00:00,  1.39s/it, est. speed input: 13.82 toks/s, output: 22.97 toks/s]
[([85, 4086, 44, 374, 264, 1550, 42747, 628, 323, 4938, 72816, 44378, 323, 13480, 4712, 369, 444, 10994, 82, 624, 2132, 6147, 3847, 311, 1598, 44378, 389, 3460, 12934, 4119, 304, 15279, 11, 1393, 1083, 8241, 11050, 2473, 9691, 13, 1084, 11554, 2176, 4237, 323, 2205, 11320, 11, 448, 279, 5726, 311], 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\nIt allows users to run inference on large-scale models in parallel, while also providing efficient service delivery. It supports both distributed and local execution, with the ability to'), ([85984, 398, 7512, 279, 3598, 68276, 304, 279, 4401, 315, 20443, 11229, 504, 220, 16, 24, 20, 15, 311, 220, 17, 15, 17, 15, 624, 785, 4401, 315, 58194, 21392, 320, 15469, 8, 702, 1012, 264, 1293, 323, 6351, 1882, 429, 6009, 448, 279, 4124, 975, 389, 5662, 6832, 25185, 304, 279, 5099, 12, 17, 15, 339], 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\nThe development of Artificial Intelligence (AI) has been a long and complex process that began with the early work on machine learning algorithms in the mid-20th'), ([27374, 323, 12872, 20443, 11229, 448, 3738, 11229, 304, 3793, 315, 8692, 1995, 624, 9286, 16488, 21392, 320, 15469, 8, 323, 11097, 21392, 320, 23913, 8, 525, 1378, 12460, 18940, 429, 7512, 2155, 13566, 315, 279, 8109, 594, 8692, 16928, 13, 5692, 374, 264, 12313, 323], "Compare and contrast artificial intelligence with human intelligence in terms of processing information.\nArtificial Intelligence (AI) and Human Intelligence (HI) are two distinct concepts that describe different aspects of the brain's processing capabilities. Here is a comparison and"), ([74785, 279, 6770, 6813, 315, 264, 29728, 3922, 323, 1246, 432, 646, 387, 16176, 624, 32, 29728, 3922, 374, 264, 943, 315, 5662, 6832, 1614, 429, 17167, 315, 13617, 315, 82316, 7798, 476, 33213, 13, 576, 6770, 6813, 315, 264, 29728, 3922, 2924, 1447, 16, 13, 5571], 'Describe the basic components of a neural network and how it can be trained.\nA neural network is a type of machine learning model that consists of layers of interconnected nodes or neurons. The basic components of a neural network include:\n\n1. Input'), ([7985, 264, 2805, 3364, 911, 264, 12305, 429, 18707, 369, 279, 1156, 882, 624, 12522, 5193, 264, 882, 11, 1052, 572, 264, 12305, 6941, 431, 17, 9420, 17, 13, 1260, 572, 264, 47394, 323, 7988, 5662, 448, 264, 9906, 2518, 27263, 323, 264, 2613, 11, 4778], 'Write a short story about a robot that dreams for the first time.\nOnce upon a time, there was a robot named R2-D2. He was a sleek and powerful machine with a bright red exterior and a small, round'), ([2082, 55856, 279, 5421, 315, 279, 19966, 12, 16, 24, 27422, 389, 3644, 6955, 14389, 323, 3853, 2562, 4119, 624, 785, 19966, 12, 16, 24, 27422, 702, 1030, 264, 27155, 5421, 389, 3644, 6955, 14389, 323, 3853, 2562, 4119, 13, 5692, 525, 1045, 1376, 5510, 304, 892, 432, 702, 11495, 1493, 5671], 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\nThe COVID-19 pandemic has had a profound impact on global economic structures and future business models. Here are some key ways in which it has affected these areas'), ([840, 20772, 279, 12752, 25361, 315, 279, 98783, 28556, 18824, 11, 323, 1246, 1181, 20431, 2578, 13289, 304, 10867, 19041, 18028, 33675, 624, 785, 98783, 28556, 374, 264, 11245, 18824, 553, 65386, 2994, 96766, 429, 702, 6427, 54686, 279, 1879, 369, 23631, 13, 1084, 572, 23983, 1948, 220, 16, 20, 15, 18, 323, 220, 16], 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\nThe Mona Lisa is a famous painting by Leonardo da Vinci that has captivated the world for centuries. It was painted between 1503 and 1'), ([27473, 279, 2701, 6364, 11652, 1119, 10769, 11, 8585, 11, 323, 4492, 1466, 3921, 25, 364, 785, 4124, 11958, 37834, 279, 34211, 23421, 32, 13, 576, 4124, 11958, 37834, 279, 34211, 304, 8453, 624, 33, 13, 220, 99391, 86117, 28195, 60726, 19655, 102176, 29412, 125232, 128687, 8997, 34, 13, 220, 99391, 86117, 28195, 60726, 19655], "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\nA. The early bird catches the worm in Chinese.\nB. 早番が先に虫を食べられる。\nC. 早番が先に")]
****************************************************************************************************
[([85, 4086, 44, 374, 264, 1550, 42747, 628, 323, 4938, 72816, 44378, 323, 13480, 4712, 369, 444, 10994, 82, 624, 2132, 374, 6188, 311, 387, 1483, 304, 264, 8045, 315, 8357, 11, 2670, 5810, 4128, 8692, 11, 6366, 11129, 11, 323, 8806, 17843, 13, 576, 4712, 374, 5798, 389, 1909, 315, 279], 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\nIt is designed to be used in a variety of applications, including natural language processing, computer vision, and speech recognition. The engine is built on top of the'), ([85984, 398, 7512, 279, 3598, 68276, 304, 279, 4401, 315, 20443, 11229, 504, 220, 16, 24, 20, 15, 311, 220, 17, 15, 17, 15, 624, 785, 4401, 315, 20443, 11229, 320, 15469, 8, 702, 1012, 264, 1293, 323, 6351, 1882, 429, 702, 4429, 1992, 916, 3807, 10793, 13, 5692, 525, 279, 3598, 68276, 304, 279, 4401, 315], 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\nThe development of artificial intelligence (AI) has been a long and complex process that has taken place over several decades. Here are the major milestones in the development of'), ([27374, 323, 12872, 20443, 11229, 448, 3738, 11229, 304, 3793, 315, 8692, 1995, 624, 9286, 16488, 11229, 320, 15469, 8, 323, 3738, 11229, 525, 1378, 12460, 18940, 429, 525, 3545, 1483, 51263, 2845, 11, 714, 807, 525, 537, 279, 1852, 3166, 13, 15235, 19257, 311, 279], 'Compare and contrast artificial intelligence with human intelligence in terms of processing information.\nArtificial intelligence (AI) and human intelligence are two distinct concepts that are often used interchangeably, but they are not the same thing. AI refers to the'), ([74785, 279, 6770, 6813, 315, 264, 29728, 3922, 323, 1246, 432, 646, 387, 16176, 624, 32, 29728, 3922, 374, 264, 943, 315, 5662, 6832, 1614, 429, 374, 1483, 311, 2736, 9079, 1741, 438, 2168, 17843, 11, 5810, 4128, 8692, 11, 323, 8806, 17843, 13, 1084, 17167, 315], 'Describe the basic components of a neural network and how it can be trained.\nA neural network is a type of machine learning model that is used to perform tasks such as image recognition, natural language processing, and speech recognition. It consists of'), ([7985, 264, 2805, 3364, 911, 264, 12305, 429, 18707, 369, 279, 1156, 882, 624, 12522, 5193, 264, 882, 11, 1052, 572, 264, 12305, 6941, 431, 17, 9420, 17, 13, 431, 17, 9420, 17, 572, 264, 47394, 323, 47394, 12305, 429, 1030, 1012, 6188, 311, 387, 279], 'Write a short story about a robot that dreams for the first time.\nOnce upon a time, there was a robot named R2-D2. R2-D2 was a sleek and sleek robot that had been designed to be the'), ([2082, 55856, 279, 5421, 315, 279, 19966, 12, 16, 24, 27422, 389, 3644, 6955, 14389, 323, 3853, 2562, 4119, 624, 785, 19966, 12, 16, 24, 27422, 702, 1030, 264, 27155, 5421, 389, 3644, 6955, 14389, 323, 3853, 2562, 4119, 13, 5692, 525, 1045, 315, 279, 1376, 5510, 304, 892, 279, 27422, 702], 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\nThe COVID-19 pandemic has had a profound impact on global economic structures and future business models. Here are some of the key ways in which the pandemic has'), ([840, 20772, 279, 12752, 25361, 315, 279, 98783, 28556, 18824, 11, 323, 1246, 1181, 20431, 2578, 13289, 304, 10867, 19041, 18028, 33675, 624, 785, 98783, 28556, 374, 264, 11245, 18824, 553, 65386, 2994, 96766, 11, 3465, 304, 279, 4124, 220, 16, 21, 339, 9294, 13, 1084, 374, 825, 315, 279, 1429, 11245, 35592, 304, 279], 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\nThe Mona Lisa is a famous painting by Leonardo da Vinci, created in the early 16th century. It is one of the most famous paintings in the'), ([27473, 279, 2701, 6364, 11652, 1119, 10769, 11, 8585, 11, 323, 4492, 1466, 3921, 25, 364, 785, 4124, 11958, 37834, 279, 34211, 23421, 32, 25, 220, 99391, 86117, 15322, 99391, 86117, 19655, 109434, 42414, 32441, 8997, 33, 25, 220, 99391, 86117, 15322, 99391, 86117, 19655, 109434, 42414, 32441, 8997, 34, 25, 220, 99391, 86117, 15322], "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\nA: 早番は早番に勝ちます。\nB: 早番は早番に勝ちます。\nC: 早番は")]
FAILED
tests/test_chunk_prefill.py::test_models[1-True-4-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct] WARNING 04-29 08:11:10 config.py:2448] Casting torch.bfloat16 to torch.float16.
WARNING 04-29 08:11:24 config.py:2448] Casting torch.bfloat16 to torch.float16.
INFO 04-29 08:11:24 config.py:549] This model supports multiple tasks: {'embed', 'score', 'classify', 'reward', 'generate'}. Defaulting to 'generate'.
INFO 04-29 08:11:24 config.py:1555] Chunked prefill is enabled with max_num_batched_tokens=4.
INFO 04-29 08:11:24 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct', speculative_config=None, tokenizer='/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False, 
WARNING 04-29 08:11:25 utils.py:2262] Methods add_lora,add_prompt_adapter,cache_config,compilation_config,current_platform,list_loras,list_prompt_adapters,load_config,pin_lora,pin_prompt_adapter,remove_lora,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfff5d95ce3e0>
INFO 04-29 08:11:25 model_runner.py:822] Starting to load model /home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct...
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.41it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.41it/s]

INFO 04-29 08:11:26 model_runner.py:827] Loading model weights took 0.9241 GB
INFO 04-29 08:11:27 executor_base.py:111] # npu blocks: 294756, # CPU blocks: 21845
INFO 04-29 08:11:27 executor_base.py:116] Maximum concurrency for 1024 tokens per request: 4605.56x
INFO 04-29 08:11:27 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 1.05 seconds
Processed prompts: 100%|████████████████████████████████████████████████████████████████| 8/8 [00:03<00:00,  2.13it/s, est. speed input: 41.00 toks/s, output: 68.15 toks/s]
[([85, 4086, 44, 374, 264, 1550, 42747, 628, 323, 4938, 72816, 44378, 323, 13480, 4712, 369, 444, 10994, 82, 624, 2132, 6147, 3847, 311, 1598, 44378, 389, 3460, 12934, 4119, 304, 15279, 11, 1393, 1083, 8241, 11050, 2473, 9691, 13, 1084, 11554, 2176, 4237, 323, 2205, 11320, 11, 448, 279, 5726, 311], 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\nIt allows users to run inference on large-scale models in parallel, while also providing efficient service delivery. It supports both distributed and local execution, with the ability to'), ([85984, 398, 7512, 279, 3598, 68276, 304, 279, 4401, 315, 20443, 11229, 504, 220, 16, 24, 20, 15, 311, 220, 17, 15, 17, 15, 624, 785, 4401, 315, 58194, 21392, 320, 15469, 8, 702, 1012, 264, 1293, 323, 6351, 1882, 429, 6009, 448, 279, 4124, 975, 389, 5662, 6832, 25185, 304, 279, 5099, 12, 17, 15, 339], 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\nThe development of Artificial Intelligence (AI) has been a long and complex process that began with the early work on machine learning algorithms in the mid-20th'), ([27374, 323, 12872, 20443, 11229, 448, 3738, 11229, 304, 3793, 315, 8692, 1995, 624, 9286, 16488, 21392, 320, 15469, 8, 323, 11097, 21392, 320, 23913, 8, 525, 1378, 12460, 18940, 429, 7512, 2155, 13566, 315, 279, 8109, 594, 8692, 16928, 13, 5692, 374, 264, 12313, 323], "Compare and contrast artificial intelligence with human intelligence in terms of processing information.\nArtificial Intelligence (AI) and Human Intelligence (HI) are two distinct concepts that describe different aspects of the brain's processing capabilities. Here is a comparison and"), ([74785, 279, 6770, 6813, 315, 264, 29728, 3922, 323, 1246, 432, 646, 387, 16176, 624, 32, 29728, 3922, 374, 264, 943, 315, 5662, 6832, 1614, 429, 17167, 315, 13617, 315, 82316, 7798, 476, 33213, 13, 576, 6770, 6813, 315, 264, 29728, 3922, 2924, 1447, 16, 13, 5571], 'Describe the basic components of a neural network and how it can be trained.\nA neural network is a type of machine learning model that consists of layers of interconnected nodes or neurons. The basic components of a neural network include:\n\n1. Input'), ([7985, 264, 2805, 3364, 911, 264, 12305, 429, 18707, 369, 279, 1156, 882, 624, 12522, 5193, 264, 882, 11, 1052, 572, 264, 12305, 6941, 431, 17, 9420, 17, 13, 1260, 572, 264, 47394, 323, 7988, 5662, 448, 264, 9906, 2518, 27263, 323, 264, 2613, 11, 4778], 'Write a short story about a robot that dreams for the first time.\nOnce upon a time, there was a robot named R2-D2. He was a sleek and powerful machine with a bright red exterior and a small, round'), ([2082, 55856, 279, 5421, 315, 279, 19966, 12, 16, 24, 27422, 389, 3644, 6955, 14389, 323, 3853, 2562, 4119, 624, 785, 19966, 12, 16, 24, 27422, 702, 1030, 264, 27155, 5421, 389, 3644, 6955, 14389, 323, 3853, 2562, 4119, 13, 5692, 525, 1045, 1376, 5510, 304, 892, 432, 702, 11495, 1493, 5671], 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\nThe COVID-19 pandemic has had a profound impact on global economic structures and future business models. Here are some key ways in which it has affected these areas'), ([840, 20772, 279, 12752, 25361, 315, 279, 98783, 28556, 18824, 11, 323, 1246, 1181, 20431, 2578, 13289, 304, 10867, 19041, 18028, 33675, 624, 785, 98783, 28556, 374, 264, 11245, 18824, 553, 65386, 2994, 96766, 429, 702, 6427, 54686, 279, 1879, 369, 23631, 13, 1084, 572, 23983, 1948, 220, 16, 20, 15, 18, 323, 220, 16], 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\nThe Mona Lisa is a famous painting by Leonardo da Vinci that has captivated the world for centuries. It was painted between 1503 and 1'), ([27473, 279, 2701, 6364, 11652, 1119, 10769, 11, 8585, 11, 323, 4492, 1466, 3921, 25, 364, 785, 4124, 11958, 37834, 279, 34211, 23421, 32, 13, 576, 4124, 11958, 37834, 279, 34211, 304, 8453, 624, 33, 13, 220, 99391, 86117, 28195, 60726, 19655, 102176, 29412, 125232, 128687, 8997, 34, 13, 220, 99391, 86117, 28195, 60726, 19655], "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\nA. The early bird catches the worm in Chinese.\nB. 早番が先に虫を食べられる。\nC. 早番が先に")]
****************************************************************************************************
[([85, 4086, 44, 374, 264, 1550, 42747, 628, 323, 4938, 72816, 44378, 323, 13480, 4712, 369, 444, 10994, 82, 624, 2132, 374, 6188, 311, 387, 1483, 304, 264, 8045, 315, 8357, 11, 2670, 5810, 4128, 8692, 11, 6366, 11129, 11, 323, 8806, 17843, 13, 576, 4712, 374, 5798, 389, 1909, 315, 279], 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\nIt is designed to be used in a variety of applications, including natural language processing, computer vision, and speech recognition. The engine is built on top of the'), ([85984, 398, 7512, 279, 3598, 68276, 304, 279, 4401, 315, 20443, 11229, 504, 220, 16, 24, 20, 15, 311, 220, 17, 15, 17, 15, 624, 785, 4401, 315, 20443, 11229, 320, 15469, 8, 702, 1012, 264, 1293, 323, 6351, 1882, 429, 702, 4429, 1992, 916, 3807, 10793, 13, 5692, 525, 279, 3598, 68276, 304, 279, 4401, 315], 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\nThe development of artificial intelligence (AI) has been a long and complex process that has taken place over several decades. Here are the major milestones in the development of'), ([27374, 323, 12872, 20443, 11229, 448, 3738, 11229, 304, 3793, 315, 8692, 1995, 624, 9286, 16488, 11229, 320, 15469, 8, 323, 3738, 11229, 525, 1378, 12460, 18940, 429, 525, 3545, 1483, 51263, 2845, 11, 714, 807, 525, 537, 279, 1852, 3166, 13, 15235, 19257, 311, 279], 'Compare and contrast artificial intelligence with human intelligence in terms of processing information.\nArtificial intelligence (AI) and human intelligence are two distinct concepts that are often used interchangeably, but they are not the same thing. AI refers to the'), ([74785, 279, 6770, 6813, 315, 264, 29728, 3922, 323, 1246, 432, 646, 387, 16176, 624, 32, 29728, 3922, 374, 264, 943, 315, 5662, 6832, 1614, 429, 374, 1483, 311, 2736, 9079, 1741, 438, 2168, 17843, 11, 5810, 4128, 8692, 11, 323, 8806, 17843, 13, 1084, 17167, 315], 'Describe the basic components of a neural network and how it can be trained.\nA neural network is a type of machine learning model that is used to perform tasks such as image recognition, natural language processing, and speech recognition. It consists of'), ([7985, 264, 2805, 3364, 911, 264, 12305, 429, 18707, 369, 279, 1156, 882, 624, 12522, 5193, 264, 882, 11, 1052, 572, 264, 12305, 6941, 431, 17, 9420, 17, 13, 431, 17, 9420, 17, 572, 264, 47394, 323, 47394, 12305, 429, 1030, 1012, 6188, 311, 387, 279], 'Write a short story about a robot that dreams for the first time.\nOnce upon a time, there was a robot named R2-D2. R2-D2 was a sleek and sleek robot that had been designed to be the'), ([2082, 55856, 279, 5421, 315, 279, 19966, 12, 16, 24, 27422, 389, 3644, 6955, 14389, 323, 3853, 2562, 4119, 624, 785, 19966, 12, 16, 24, 27422, 702, 1030, 264, 27155, 5421, 389, 3644, 6955, 14389, 323, 3853, 2562, 4119, 13, 5692, 525, 1045, 315, 279, 1376, 5510, 304, 892, 279, 27422, 702], 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\nThe COVID-19 pandemic has had a profound impact on global economic structures and future business models. Here are some of the key ways in which the pandemic has'), ([840, 20772, 279, 12752, 25361, 315, 279, 98783, 28556, 18824, 11, 323, 1246, 1181, 20431, 2578, 13289, 304, 10867, 19041, 18028, 33675, 624, 785, 98783, 28556, 374, 264, 11245, 18824, 553, 65386, 2994, 96766, 11, 3465, 304, 279, 4124, 220, 16, 21, 339, 9294, 13, 1084, 374, 6509, 825, 315, 279, 1429, 11245, 35592, 304], 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\nThe Mona Lisa is a famous painting by Leonardo da Vinci, created in the early 16th century. It is considered one of the most famous paintings in'), ([27473, 279, 2701, 6364, 11652, 1119, 10769, 11, 8585, 11, 323, 4492, 1466, 3921, 25, 364, 785, 4124, 11958, 37834, 279, 34211, 23421, 32, 25, 220, 99391, 86117, 15322, 99391, 86117, 19655, 109434, 42414, 32441, 8997, 33, 25, 220, 99391, 86117, 15322, 99391, 86117, 19655, 109434, 42414, 32441, 8997, 34, 25, 220, 99391, 86117, 15322], "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\nA: 早番は早番に勝ちます。\nB: 早番は早番に勝ちます。\nC: 早番は")]
FAILED
tests/test_chunk_prefill.py::test_models[1-True-16-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct] WARNING 04-29 08:11:51 config.py:2448] Casting torch.bfloat16 to torch.float16.
WARNING 04-29 08:12:06 config.py:2448] Casting torch.bfloat16 to torch.float16.
INFO 04-29 08:12:06 config.py:549] This model supports multiple tasks: {'embed', 'score', 'classify', 'reward', 'generate'}. Defaulting to 'generate'.
INFO 04-29 08:12:06 config.py:1555] Chunked prefill is enabled with max_num_batched_tokens=16.
INFO 04-29 08:12:06 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct', speculative_config=None, tokenizer='/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False, 
WARNING 04-29 08:12:07 utils.py:2262] Methods add_lora,add_prompt_adapter,cache_config,compilation_config,current_platform,list_loras,list_prompt_adapters,load_config,pin_lora,pin_prompt_adapter,remove_lora,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfff5c21567d0>
INFO 04-29 08:12:07 model_runner.py:822] Starting to load model /home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct...
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.55it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  2.55it/s]

INFO 04-29 08:12:08 model_runner.py:827] Loading model weights took 0.9241 GB
INFO 04-29 08:12:08 executor_base.py:111] # npu blocks: 294212, # CPU blocks: 21845
INFO 04-29 08:12:08 executor_base.py:116] Maximum concurrency for 1024 tokens per request: 4597.06x
INFO 04-29 08:12:08 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 0.73 seconds
Processed prompts: 100%|██████████████████████████████████████████████████████████████| 8/8 [00:01<00:00,  5.88it/s, est. speed input: 113.12 toks/s, output: 188.03 toks/s]
[([85, 4086, 44, 374, 264, 1550, 42747, 628, 323, 4938, 72816, 44378, 323, 13480, 4712, 369, 444, 10994, 82, 624, 2132, 6147, 3847, 311, 1598, 44378, 389, 3460, 12934, 4119, 304, 15279, 11, 1393, 1083, 8241, 11050, 2473, 9691, 13, 1084, 11554, 2176, 4237, 323, 2205, 11320, 11, 448, 279, 5726, 311], 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\nIt allows users to run inference on large-scale models in parallel, while also providing efficient service delivery. It supports both distributed and local execution, with the ability to'), ([85984, 398, 7512, 279, 3598, 68276, 304, 279, 4401, 315, 20443, 11229, 504, 220, 16, 24, 20, 15, 311, 220, 17, 15, 17, 15, 624, 785, 4401, 315, 58194, 21392, 320, 15469, 8, 702, 1012, 264, 1293, 323, 6351, 1882, 429, 6009, 448, 279, 4124, 975, 389, 5662, 6832, 25185, 304, 279, 5099, 12, 17, 15, 339], 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\nThe development of Artificial Intelligence (AI) has been a long and complex process that began with the early work on machine learning algorithms in the mid-20th'), ([27374, 323, 12872, 20443, 11229, 448, 3738, 11229, 304, 3793, 315, 8692, 1995, 624, 9286, 16488, 21392, 320, 15469, 8, 323, 11097, 21392, 320, 23913, 8, 525, 1378, 12460, 18940, 429, 7512, 2155, 13566, 315, 279, 8109, 594, 8692, 16928, 13, 5692, 374, 264, 12313, 323], "Compare and contrast artificial intelligence with human intelligence in terms of processing information.\nArtificial Intelligence (AI) and Human Intelligence (HI) are two distinct concepts that describe different aspects of the brain's processing capabilities. Here is a comparison and"), ([74785, 279, 6770, 6813, 315, 264, 29728, 3922, 323, 1246, 432, 646, 387, 16176, 624, 32, 29728, 3922, 374, 264, 943, 315, 5662, 6832, 1614, 429, 17167, 315, 13617, 315, 82316, 7798, 476, 33213, 13, 576, 6770, 6813, 315, 264, 29728, 3922, 2924, 1447, 16, 13, 5571], 'Describe the basic components of a neural network and how it can be trained.\nA neural network is a type of machine learning model that consists of layers of interconnected nodes or neurons. The basic components of a neural network include:\n\n1. Input'), ([7985, 264, 2805, 3364, 911, 264, 12305, 429, 18707, 369, 279, 1156, 882, 624, 12522, 5193, 264, 882, 11, 1052, 572, 264, 12305, 6941, 431, 17, 9420, 17, 13, 1260, 572, 264, 47394, 323, 7988, 5662, 448, 264, 9906, 2518, 27263, 323, 264, 2613, 11, 4778], 'Write a short story about a robot that dreams for the first time.\nOnce upon a time, there was a robot named R2-D2. He was a sleek and powerful machine with a bright red exterior and a small, round'), ([2082, 55856, 279, 5421, 315, 279, 19966, 12, 16, 24, 27422, 389, 3644, 6955, 14389, 323, 3853, 2562, 4119, 624, 785, 19966, 12, 16, 24, 27422, 702, 1030, 264, 27155, 5421, 389, 3644, 6955, 14389, 323, 3853, 2562, 4119, 13, 5692, 525, 1045, 1376, 5510, 304, 892, 432, 702, 11495, 1493, 5671], 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\nThe COVID-19 pandemic has had a profound impact on global economic structures and future business models. Here are some key ways in which it has affected these areas'), ([840, 20772, 279, 12752, 25361, 315, 279, 98783, 28556, 18824, 11, 323, 1246, 1181, 20431, 2578, 13289, 304, 10867, 19041, 18028, 33675, 624, 785, 98783, 28556, 374, 264, 11245, 18824, 553, 65386, 2994, 96766, 429, 702, 6427, 54686, 279, 1879, 369, 23631, 13, 1084, 572, 23983, 1948, 220, 16, 20, 15, 18, 323, 220, 16], 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\nThe Mona Lisa is a famous painting by Leonardo da Vinci that has captivated the world for centuries. It was painted between 1503 and 1'), ([27473, 279, 2701, 6364, 11652, 1119, 10769, 11, 8585, 11, 323, 4492, 1466, 3921, 25, 364, 785, 4124, 11958, 37834, 279, 34211, 23421, 32, 13, 576, 4124, 11958, 37834, 279, 34211, 304, 8453, 624, 33, 13, 220, 99391, 86117, 28195, 60726, 19655, 102176, 29412, 125232, 128687, 8997, 34, 13, 220, 99391, 86117, 28195, 60726, 19655], "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\nA. The early bird catches the worm in Chinese.\nB. 早番が先に虫を食べられる。\nC. 早番が先に")]
****************************************************************************************************
[([85, 4086, 44, 374, 264, 1550, 42747, 628, 323, 4938, 72816, 44378, 323, 13480, 4712, 369, 444, 10994, 82, 624, 2132, 374, 6188, 311, 387, 1483, 304, 264, 8045, 315, 8357, 11, 2670, 5810, 4128, 8692, 11, 6366, 11129, 11, 323, 8806, 17843, 13, 576, 4712, 374, 5798, 389, 1909, 315, 279], 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\nIt is designed to be used in a variety of applications, including natural language processing, computer vision, and speech recognition. The engine is built on top of the'), ([85984, 398, 7512, 279, 3598, 68276, 304, 279, 4401, 315, 20443, 11229, 504, 220, 16, 24, 20, 15, 311, 220, 17, 15, 17, 15, 624, 785, 4401, 315, 20443, 11229, 320, 15469, 8, 702, 1012, 264, 1293, 323, 6351, 1882, 429, 702, 4429, 1992, 916, 3807, 10793, 13, 5692, 525, 279, 3598, 68276, 304, 279, 4401, 315], 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\nThe development of artificial intelligence (AI) has been a long and complex process that has taken place over several decades. Here are the major milestones in the development of'), ([27374, 323, 12872, 20443, 11229, 448, 3738, 11229, 304, 3793, 315, 8692, 1995, 624, 9286, 16488, 11229, 320, 15469, 8, 323, 3738, 11229, 525, 1378, 12460, 18940, 429, 525, 3545, 1483, 51263, 2845, 11, 714, 807, 525, 537, 279, 1852, 3166, 13, 15235, 19257, 311, 279], 'Compare and contrast artificial intelligence with human intelligence in terms of processing information.\nArtificial intelligence (AI) and human intelligence are two distinct concepts that are often used interchangeably, but they are not the same thing. AI refers to the'), ([74785, 279, 6770, 6813, 315, 264, 29728, 3922, 323, 1246, 432, 646, 387, 16176, 624, 32, 29728, 3922, 374, 264, 943, 315, 5662, 6832, 1614, 429, 374, 1483, 311, 2736, 9079, 1741, 438, 2168, 17843, 11, 5810, 4128, 8692, 11, 323, 8806, 17843, 13, 1084, 17167, 315], 'Describe the basic components of a neural network and how it can be trained.\nA neural network is a type of machine learning model that is used to perform tasks such as image recognition, natural language processing, and speech recognition. It consists of'), ([7985, 264, 2805, 3364, 911, 264, 12305, 429, 18707, 369, 279, 1156, 882, 624, 12522, 5193, 264, 882, 11, 1052, 572, 264, 12305, 6941, 431, 17, 9420, 17, 13, 431, 17, 9420, 17, 572, 264, 47394, 323, 47394, 12305, 429, 1030, 1012, 6188, 311, 387, 279], 'Write a short story about a robot that dreams for the first time.\nOnce upon a time, there was a robot named R2-D2. R2-D2 was a sleek and sleek robot that had been designed to be the'), ([2082, 55856, 279, 5421, 315, 279, 19966, 12, 16, 24, 27422, 389, 3644, 6955, 14389, 323, 3853, 2562, 4119, 624, 785, 19966, 12, 16, 24, 27422, 702, 1030, 264, 27155, 5421, 389, 3644, 6955, 14389, 323, 3853, 2562, 4119, 13, 5692, 525, 1045, 315, 279, 1376, 5510, 304, 892, 279, 27422, 702], 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\nThe COVID-19 pandemic has had a profound impact on global economic structures and future business models. Here are some of the key ways in which the pandemic has'), ([840, 20772, 279, 12752, 25361, 315, 279, 98783, 28556, 18824, 11, 323, 1246, 1181, 20431, 2578, 13289, 304, 10867, 19041, 18028, 33675, 624, 785, 98783, 28556, 374, 264, 11245, 18824, 553, 65386, 2994, 96766, 11, 3465, 304, 279, 4124, 220, 16, 21, 339, 9294, 13, 1084, 374, 6509, 825, 315, 279, 1429, 11245, 35592, 304], 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\nThe Mona Lisa is a famous painting by Leonardo da Vinci, created in the early 16th century. It is considered one of the most famous paintings in'), ([27473, 279, 2701, 6364, 11652, 1119, 10769, 11, 8585, 11, 323, 4492, 1466, 3921, 25, 364, 785, 4124, 11958, 37834, 279, 34211, 23421, 32, 25, 220, 99391, 86117, 15322, 99391, 86117, 19655, 109434, 42414, 32441, 8997, 33, 25, 220, 99391, 86117, 15322, 99391, 86117, 19655, 109434, 42414, 32441, 8997, 34, 25, 220, 99391, 86117, 15322], "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\nA: 早番は早番に勝ちます。\nB: 早番は早番に勝ちます。\nC: 早番は")]
FAILED

================================================================================= FAILURES =================================================================================
______________________________________ test_models[1-True-1-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct] ______________________________________

hf_runner = <class 'tests.conftest.HfRunner'>, vllm_runner = <class 'tests.conftest.VllmRunner'>
example_prompts = ['vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n', 'Briefly describe the majo...me.\n', 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\n', ...]
model = '/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct', dtype = 'half', max_tokens = 32, chunked_prefill_token_size = 1, enforce_eager = True
tensor_parallel_size = 1

    @pytest.mark.parametrize("model", MODELS)
    @pytest.mark.parametrize("dtype", ["half"])
    @pytest.mark.parametrize("max_tokens", [32])
    @pytest.mark.parametrize("chunked_prefill_token_size", [1,
                                                            4, 16
                                                            ])
    @pytest.mark.parametrize("enforce_eager", [True])
    # NOTE: Increasing this in this suite will fail CI because we currently cannot
    # reset distributed env properly. Use a value > 1 just when you test.
    @pytest.mark.parametrize("tensor_parallel_size", [1])
    def test_models(
        hf_runner,
        vllm_runner,
        example_prompts,
        model: str,
        dtype: str,
        max_tokens: int,
        chunked_prefill_token_size: int,
        enforce_eager: bool,
        tensor_parallel_size: int,
    ) -> None:
        """
        Checks exact match decode between huggingface model and vllm runner with
        chunked prefill.
        """
    
        max_num_seqs = chunked_prefill_token_size
        max_num_batched_tokens = chunked_prefill_token_size
    
        with hf_runner(model, dtype=dtype) as hf_model:
            hf_outputs = hf_model.generate_greedy(example_prompts, max_tokens)
    
        with vllm_runner(
                model,
                dtype=dtype,
                max_num_batched_tokens=max_num_batched_tokens,
                enable_chunked_prefill=True,
                tensor_parallel_size=tensor_parallel_size,
                enforce_eager=enforce_eager,
                max_num_seqs=max_num_seqs,
        ) as vllm_model:
            vllm_outputs = vllm_model.generate_greedy(example_prompts, max_tokens)
    
        print(hf_outputs)
        print(100*"*")
        print(vllm_outputs)
>       check_outputs_equal(
            outputs_0_lst=hf_outputs,
            outputs_1_lst=vllm_outputs,
            name_0="hf",
            name_1="vllm",
        )

tests/test_chunk_prefill.py:70: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    def check_outputs_equal(
        *,
        outputs_0_lst: Sequence[TokensText],
        outputs_1_lst: Sequence[TokensText],
        name_0: str,
        name_1: str,
    ):
        """
        Compare the two sequences generated by different models,
        which should be equal.
        """
        assert len(outputs_0_lst) == len(outputs_1_lst)
    
        for prompt_idx, (outputs_0,
                         outputs_1) in enumerate(zip(outputs_0_lst,
                                                     outputs_1_lst)):
            output_ids_0, output_str_0 = outputs_0
            output_ids_1, output_str_1 = outputs_1
    
            # The text and token outputs should exactly match
            fail_msg = (f"Test{prompt_idx}:"
                        f"\n{name_0}:\t{output_str_0!r}"
                        f"\n{name_1}:\t{output_str_1!r}")
    
>           assert output_str_0 == output_str_1, fail_msg
E           AssertionError: Test0:
E           hf: 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\nIt allows users to run inference on large-scale models in parallel, while also providing efficient service delivery. It supports both distributed and local execution, with the ability to'
E           vllm:       'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\nIt is designed to be used in a variety of applications, including natural language processing, computer vision, and speech recognition. The engine is built on top of the'

tests/model_utils.py:55: AssertionError
______________________________________ test_models[1-True-4-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct] ______________________________________

hf_runner = <class 'tests.conftest.HfRunner'>, vllm_runner = <class 'tests.conftest.VllmRunner'>
example_prompts = ['vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n', 'Briefly describe the majo...me.\n', 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\n', ...]
model = '/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct', dtype = 'half', max_tokens = 32, chunked_prefill_token_size = 4, enforce_eager = True
tensor_parallel_size = 1

    @pytest.mark.parametrize("model", MODELS)
    @pytest.mark.parametrize("dtype", ["half"])
    @pytest.mark.parametrize("max_tokens", [32])
    @pytest.mark.parametrize("chunked_prefill_token_size", [1,
                                                            4, 16
                                                            ])
    @pytest.mark.parametrize("enforce_eager", [True])
    # NOTE: Increasing this in this suite will fail CI because we currently cannot
    # reset distributed env properly. Use a value > 1 just when you test.
    @pytest.mark.parametrize("tensor_parallel_size", [1])
    def test_models(
        hf_runner,
        vllm_runner,
        example_prompts,
        model: str,
        dtype: str,
        max_tokens: int,
        chunked_prefill_token_size: int,
        enforce_eager: bool,
        tensor_parallel_size: int,
    ) -> None:
        """
        Checks exact match decode between huggingface model and vllm runner with
        chunked prefill.
        """
    
        max_num_seqs = chunked_prefill_token_size
        max_num_batched_tokens = chunked_prefill_token_size
    
        with hf_runner(model, dtype=dtype) as hf_model:
            hf_outputs = hf_model.generate_greedy(example_prompts, max_tokens)
    
        with vllm_runner(
                model,
                dtype=dtype,
                max_num_batched_tokens=max_num_batched_tokens,
                enable_chunked_prefill=True,
                tensor_parallel_size=tensor_parallel_size,
                enforce_eager=enforce_eager,
                max_num_seqs=max_num_seqs,
        ) as vllm_model:
            vllm_outputs = vllm_model.generate_greedy(example_prompts, max_tokens)
    
        print(hf_outputs)
        print(100*"*")
        print(vllm_outputs)
>       check_outputs_equal(
            outputs_0_lst=hf_outputs,
            outputs_1_lst=vllm_outputs,
            name_0="hf",
            name_1="vllm",
        )

tests/test_chunk_prefill.py:70: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    def check_outputs_equal(
        *,
        outputs_0_lst: Sequence[TokensText],
        outputs_1_lst: Sequence[TokensText],
        name_0: str,
        name_1: str,
    ):
        """
        Compare the two sequences generated by different models,
        which should be equal.
        """
        assert len(outputs_0_lst) == len(outputs_1_lst)
    
        for prompt_idx, (outputs_0,
                         outputs_1) in enumerate(zip(outputs_0_lst,
                                                     outputs_1_lst)):
            output_ids_0, output_str_0 = outputs_0
            output_ids_1, output_str_1 = outputs_1
    
            # The text and token outputs should exactly match
            fail_msg = (f"Test{prompt_idx}:"
                        f"\n{name_0}:\t{output_str_0!r}"
                        f"\n{name_1}:\t{output_str_1!r}")
    
>           assert output_str_0 == output_str_1, fail_msg
E           AssertionError: Test0:
E           hf: 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\nIt allows users to run inference on large-scale models in parallel, while also providing efficient service delivery. It supports both distributed and local execution, with the ability to'
E           vllm:       'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\nIt is designed to be used in a variety of applications, including natural language processing, computer vision, and speech recognition. The engine is built on top of the'

tests/model_utils.py:55: AssertionError
_____________________________________ test_models[1-True-16-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct] ______________________________________

hf_runner = <class 'tests.conftest.HfRunner'>, vllm_runner = <class 'tests.conftest.VllmRunner'>
example_prompts = ['vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n', 'Briefly describe the majo...me.\n', 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\n', ...]
model = '/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct', dtype = 'half', max_tokens = 32, chunked_prefill_token_size = 16, enforce_eager = True
tensor_parallel_size = 1

    @pytest.mark.parametrize("model", MODELS)
    @pytest.mark.parametrize("dtype", ["half"])
    @pytest.mark.parametrize("max_tokens", [32])
    @pytest.mark.parametrize("chunked_prefill_token_size", [1,
                                                            4, 16
                                                            ])
    @pytest.mark.parametrize("enforce_eager", [True])
    # NOTE: Increasing this in this suite will fail CI because we currently cannot
    # reset distributed env properly. Use a value > 1 just when you test.
    @pytest.mark.parametrize("tensor_parallel_size", [1])
    def test_models(
        hf_runner,
        vllm_runner,
        example_prompts,
        model: str,
        dtype: str,
        max_tokens: int,
        chunked_prefill_token_size: int,
        enforce_eager: bool,
        tensor_parallel_size: int,
    ) -> None:
        """
        Checks exact match decode between huggingface model and vllm runner with
        chunked prefill.
        """
    
        max_num_seqs = chunked_prefill_token_size
        max_num_batched_tokens = chunked_prefill_token_size
    
        with hf_runner(model, dtype=dtype) as hf_model:
            hf_outputs = hf_model.generate_greedy(example_prompts, max_tokens)
    
        with vllm_runner(
                model,
                dtype=dtype,
                max_num_batched_tokens=max_num_batched_tokens,
                enable_chunked_prefill=True,
                tensor_parallel_size=tensor_parallel_size,
                enforce_eager=enforce_eager,
                max_num_seqs=max_num_seqs,
        ) as vllm_model:
            vllm_outputs = vllm_model.generate_greedy(example_prompts, max_tokens)
    
        print(hf_outputs)
        print(100*"*")
        print(vllm_outputs)
>       check_outputs_equal(
            outputs_0_lst=hf_outputs,
            outputs_1_lst=vllm_outputs,
            name_0="hf",
            name_1="vllm",
        )

tests/test_chunk_prefill.py:70: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    def check_outputs_equal(
        *,
        outputs_0_lst: Sequence[TokensText],
        outputs_1_lst: Sequence[TokensText],
        name_0: str,
        name_1: str,
    ):
        """
        Compare the two sequences generated by different models,
        which should be equal.
        """
        assert len(outputs_0_lst) == len(outputs_1_lst)
    
        for prompt_idx, (outputs_0,
                         outputs_1) in enumerate(zip(outputs_0_lst,
                                                     outputs_1_lst)):
            output_ids_0, output_str_0 = outputs_0
            output_ids_1, output_str_1 = outputs_1
    
            # The text and token outputs should exactly match
            fail_msg = (f"Test{prompt_idx}:"
                        f"\n{name_0}:\t{output_str_0!r}"
                        f"\n{name_1}:\t{output_str_1!r}")
    
>           assert output_str_0 == output_str_1, fail_msg
E           AssertionError: Test0:
E           hf: 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\nIt allows users to run inference on large-scale models in parallel, while also providing efficient service delivery. It supports both distributed and local execution, with the ability to'
E           vllm:       'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\nIt is designed to be used in a variety of applications, including natural language processing, computer vision, and speech recognition. The engine is built on top of the'

tests/model_utils.py:55: AssertionError
============================================================================= warnings summary =============================================================================
tests/test_chunk_prefill.py::test_models[1-True-1-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct]
tests/test_chunk_prefill.py::test_models[1-True-4-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct]
tests/test_chunk_prefill.py::test_models[1-True-16-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct]
  /home/xxx/miniconda3/envs/atb/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:631: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.7` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
    warnings.warn(

tests/test_chunk_prefill.py::test_models[1-True-1-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct]
tests/test_chunk_prefill.py::test_models[1-True-4-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct]
tests/test_chunk_prefill.py::test_models[1-True-16-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct]
  /home/xxx/miniconda3/envs/atb/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:636: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.8` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
    warnings.warn(

tests/test_chunk_prefill.py::test_models[1-True-1-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct]
tests/test_chunk_prefill.py::test_models[1-True-4-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct]
tests/test_chunk_prefill.py::test_models[1-True-16-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct]
  /home/xxx/miniconda3/envs/atb/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:653: UserWarning: `do_sample` is set to `False`. However, `top_k` is set to `20` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_k`.
    warnings.warn(

tests/test_chunk_prefill.py::test_models[1-True-1-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct]
  /home/xxx/miniconda3/envs/atb/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:292: ImportWarning: 
      *************************************************************************************************************
      The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now..
      The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now..
      The backend in torch.distributed.init_process_group set to hccl now..
      The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now..
      The device parameters have been replaced with npu in the function below:
      torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.Generator, torch.set_default_device, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.Tensor.pin_memory, torch.nn.Module.to, torch.nn.Module.to_empty
      *************************************************************************************************************
      
    warnings.warn(msg, ImportWarning)

tests/test_chunk_prefill.py::test_models[1-True-1-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct]
  /home/xxx/miniconda3/envs/atb/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:247: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu.
    warnings.warn(msg, RuntimeWarning)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================================================================= short test summary info ==========================================================================
FAILED tests/test_chunk_prefill.py::test_models[1-True-1-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct] - AssertionError: Test0:
FAILED tests/test_chunk_prefill.py::test_models[1-True-4-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct] - AssertionError: Test0:
FAILED tests/test_chunk_prefill.py::test_models[1-True-16-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct] - AssertionError: Test0:
================================================================ 3 failed, 11 warnings in 145.73s (0:02:25) ================================================================

(same with mindie-turbo)

MengqingCao · 2025-04-29T07:13:33Z

Speculative decode and MTP test pass with locally test on tests/spec_decode.

There are some limits on spec decode and mtp on vllm-ascend, will update them in doc

Potabk · 2025-04-30T07:37:59Z

Automatic Prefix Caching tested pass with nnal 8.1.RC1

test script:

import time
from vllm import LLM, SamplingParams


# A prompt containing a large markdown table. The table is randomly generated by GPT-4.
LONG_PROMPT = "You are a helpful assistant in recognizes the content of tables in markdown format. Here is a table as follows.\n# Table\n" + """
| ID  | Name          | Age | Occupation    | Country       | Email                  | Phone Number   | Address                       |
|-----|---------------|-----|---------------|---------------|------------------------|----------------|------------------------------|
| 1   | John Doe      | 29  | Engineer      | USA           | john.doe@example.com   | 555-1234       | 123 Elm St, Springfield, IL  |
| 2   | Jane Smith    | 34  | Doctor        | Canada        | jane.smith@example.com | 555-5678       | 456 Oak St, Toronto, ON      |
| 3   | Alice Johnson | 27  | Teacher       | UK            | alice.j@example.com    | 555-8765       | 789 Pine St, London, UK      |
| 4   | Bob Brown     | 45  | Artist        | Australia     | bob.b@example.com      | 555-4321       | 321 Maple St, Sydney, NSW    |
| 5   | Carol White   | 31  | Scientist     | New Zealand   | carol.w@example.com    | 555-6789       | 654 Birch St, Wellington, NZ |
| 6   | Dave Green    | 28  | Lawyer        | Ireland       | dave.g@example.com     | 555-3456       | 987 Cedar St, Dublin, IE     |
| 7   | Emma Black    | 40  | Musician      | USA           | emma.b@example.com     | 555-1111       | 246 Ash St, New York, NY     |
| 8   | Frank Blue    | 37  | Chef          | Canada        | frank.b@example.com    | 555-2222       | 135 Spruce St, Vancouver, BC |
| 9   | Grace Yellow  | 50  | Engineer      | UK            | grace.y@example.com    | 555-3333       | 864 Fir St, Manchester, UK   |
| 10  | Henry Violet  | 32  | Artist        | Australia     | henry.v@example.com    | 555-4444       | 753 Willow St, Melbourne, VIC|
| 11  | Irene Orange  | 26  | Scientist     | New Zealand   | irene.o@example.com    | 555-5555       | 912 Poplar St, Auckland, NZ  |
| 12  | Jack Indigo   | 38  | Teacher       | Ireland       | jack.i@example.com     | 555-6666       | 159 Elm St, Cork, IE         |
| 13  | Karen Red     | 41  | Lawyer        | USA           | karen.r@example.com    | 555-7777       | 357 Cedar St, Boston, MA     |
| 14  | Leo Brown     | 30  | Chef          | Canada        | leo.b@example.com      | 555-8888       | 246 Oak St, Calgary, AB      |
| 15  | Mia Green     | 33  | Musician      | UK            | mia.g@example.com      | 555-9999       | 975 Pine St, Edinburgh, UK   |
| 16  | Noah Yellow   | 29  | Doctor        | Australia     | noah.y@example.com     | 555-0000       | 864 Birch St, Brisbane, QLD  |
| 17  | Olivia Blue   | 35  | Engineer      | New Zealand   | olivia.b@example.com   | 555-1212       | 753 Maple St, Hamilton, NZ   |
| 18  | Peter Black   | 42  | Artist        | Ireland       | peter.b@example.com    | 555-3434       | 912 Fir St, Limerick, IE     |
| 19  | Quinn White   | 28  | Scientist     | USA           | quinn.w@example.com    | 555-5656       | 159 Willow St, Seattle, WA   |
| 20  | Rachel Red    | 31  | Teacher       | Canada        | rachel.r@example.com   | 555-7878       | 357 Poplar St, Ottawa, ON    |
| 21  | Steve Green   | 44  | Lawyer        | UK            | steve.g@example.com    | 555-9090       | 753 Elm St, Birmingham, UK   |
| 22  | Tina Blue     | 36  | Musician      | Australia     | tina.b@example.com     | 555-1213       | 864 Cedar St, Perth, WA      |
| 23  | Umar Black    | 39  | Chef          | New Zealand   | umar.b@example.com     | 555-3435       | 975 Spruce St, Christchurch, NZ|
| 24  | Victor Yellow | 43  | Engineer      | Ireland       | victor.y@example.com   | 555-5657       | 246 Willow St, Galway, IE    |
| 25  | Wendy Orange  | 27  | Artist        | USA           | wendy.o@example.com    | 555-7879       | 135 Elm St, Denver, CO       |
| 26  | Xavier Green  | 34  | Scientist     | Canada        | xavier.g@example.com   | 555-9091       | 357 Oak St, Montreal, QC     |
| 27  | Yara Red      | 41  | Teacher       | UK            | yara.r@example.com     | 555-1214       | 975 Pine St, Leeds, UK       |
| 28  | Zack Blue     | 30  | Lawyer        | Australia     | zack.b@example.com     | 555-3436       | 135 Birch St, Adelaide, SA   |
| 29  | Amy White     | 33  | Musician      | New Zealand   | amy.w@example.com      | 555-5658       | 159 Maple St, Wellington, NZ |
| 30  | Ben Black     | 38  | Chef          | Ireland       | ben.b@example.com      | 555-7870       | 246 Fir St, Waterford, IE    |
"""


def get_generation_time(llm, sampling_params, prompts):
    # time the generation
    start_time = time.time()
    output = llm.generate(prompts, sampling_params=sampling_params)
    end_time = time.time()
    # print the output and generation time
    print(f"Output: {output[0].outputs[0].text}")
    print(f"Generation time: {end_time - start_time} seconds.")


# set enable_prefix_caching=True to enable APC
llm = LLM(
    model='lmsys/longchat-13b-16k',
    enable_prefix_caching=True
)

sampling_params = SamplingParams(temperature=0, max_tokens=100)

# Querying the age of John Doe
get_generation_time(
    llm,
    sampling_params,
    LONG_PROMPT + "Question: what is the age of John Doe? Your answer: The age of John Doe is ",
)

# Querying the age of Zack Blue
# This query will be faster since vllm avoids computing the KV cache of LONG_PROMPT again.
get_generation_time(
    llm,
    sampling_params,
    LONG_PROMPT + "Question: what is the age of Zack Blue? Your answer: The age of Zack Blue is ",
)

result:

root@8a75c4e375f8:/# python apc_demo.py 
INFO 04-30 07:28:25 __init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 04-30 07:28:25 __init__.py:32] name=ascend, value=vllm_ascend:register
INFO 04-30 07:28:25 __init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 04-30 07:28:25 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-30 07:28:25 __init__.py:44] plugin ascend loaded.
INFO 04-30 07:28:25 __init__.py:198] Platform plugin ascend is activated
WARNING:root:Warning: Failed to register custom ops, all custom ops will be disabled
INFO 04-30 07:28:25 __init__.py:30] Available plugins for group vllm.general_plugins:
INFO 04-30 07:28:25 __init__.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 04-30 07:28:25 __init__.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 04-30 07:28:25 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-30 07:28:25 __init__.py:44] plugin ascend_enhanced_model loaded.
WARNING 04-30 07:28:25 registry.py:351] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 04-30 07:28:25 registry.py:351] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 04-30 07:28:25 registry.py:351] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 04-30 07:28:25 registry.py:351] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 04-30 07:28:25 registry.py:351] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
INFO 04-30 07:28:25 importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 04-30 07:28:37 config.py:549] This model supports multiple tasks: {'classify', 'reward', 'generate', 'embed', 'score'}. Defaulting to 'generate'.
INFO 04-30 07:28:38 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='lmsys/longchat-13b-16k', speculative_config=None, tokenizer='lmsys/longchat-13b-16k', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=lmsys/longchat-13b-16k, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
ERROR 04-30 07:28:39 camem.py:69] Failed to import vllm_ascend_C:No module named 'vllm_ascend.vllm_ascend_C'
/usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:292: ImportWarning: 
    *************************************************************************************************************
    The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now..
    The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now..
    The backend in torch.distributed.init_process_group set to hccl now..
    The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now..
    The device parameters have been replaced with npu in the function below:
    torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.Generator, torch.set_default_device, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.Tensor.pin_memory, torch.nn.Module.to, torch.nn.Module.to_empty
    *************************************************************************************************************
    
  warnings.warn(msg, ImportWarning)
/usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:247: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu.
  warnings.warn(msg, RuntimeWarning)
WARNING 04-30 07:28:39 utils.py:2262] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffd12f1d0c0>
INFO 04-30 07:28:45 model_runner.py:902] Starting to load model lmsys/longchat-13b-16k...
INFO 04-30 07:28:47 weight_utils.py:254] Using model weights format ['*.bin']
Loading pt checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading pt checkpoint shards:  33% Completed | 1/3 [00:08<00:16,  8.38s/it]
Loading pt checkpoint shards:  67% Completed | 2/3 [00:19<00:10, 10.23s/it]
Loading pt checkpoint shards: 100% Completed | 3/3 [00:30<00:00, 10.61s/it]
Loading pt checkpoint shards: 100% Completed | 3/3 [00:30<00:00, 10.33s/it]

INFO 04-30 07:29:18 model_runner.py:907] Loading model weights took 24.2871 GB
INFO 04-30 07:29:24 executor_base.py:111] # npu blocks: 283, # CPU blocks: 40
INFO 04-30 07:29:24 executor_base.py:116] Maximum concurrency for 16384 tokens per request: 2.21x
INFO 04-30 07:29:24 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 5.87 seconds
Processed prompts: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.29it/s, est. speed input: 4260.85 toks/s, output: 9.18 toks/s]
Output: 29.
Generation time: 0.4563312530517578 seconds.
Processed prompts: 100%|████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  4.95it/s, est. speed input: 9219.28 toks/s, output: 19.85 toks/s]
Output: 30.
Generation time: 0.2074754238128662 seconds.

Potabk · 2025-04-30T07:47:49Z

Qwen2.5-Vl test on v0 pass but failed on v1 with nnal 8.1.RC1

v0 test result

root@8a75c4e375f8:/workspace/vllm# vim examples/offline_inference/vision_language.py 
root@8a75c4e375f8:/workspace/vllm# python examples/offline_inference/vision_language.py 
INFO 04-30 07:42:04 __init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 04-30 07:42:04 __init__.py:32] name=ascend, value=vllm_ascend:register
INFO 04-30 07:42:04 __init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 04-30 07:42:04 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-30 07:42:04 __init__.py:44] plugin ascend loaded.
INFO 04-30 07:42:04 __init__.py:198] Platform plugin ascend is activated
WARNING:root:Warning: Failed to register custom ops, all custom ops will be disabled
INFO 04-30 07:42:04 __init__.py:30] Available plugins for group vllm.general_plugins:
INFO 04-30 07:42:04 __init__.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 04-30 07:42:04 __init__.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 04-30 07:42:04 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-30 07:42:04 __init__.py:44] plugin ascend_enhanced_model loaded.
WARNING 04-30 07:42:04 registry.py:351] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 04-30 07:42:04 registry.py:351] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 04-30 07:42:04 registry.py:351] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 04-30 07:42:04 registry.py:351] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 04-30 07:42:04 registry.py:351] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
INFO 04-30 07:42:04 importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 04-30 07:42:15 config.py:549] This model supports multiple tasks: {'reward', 'embed', 'generate', 'score', 'classify'}. Defaulting to 'generate'.
INFO 04-30 07:42:16 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/root/.cache/modelscope/models/Qwen/Qwen2___5-VL-3B-Instruct', speculative_config=None, tokenizer='/root/.cache/modelscope/models/Qwen/Qwen2___5-VL-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/root/.cache/modelscope/models/Qwen/Qwen2___5-VL-3B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs={'min_pixels': 784, 'max_pixels': 1003520, 'fps': 1}, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[8,4,2,1],"max_capture_size":8}, use_cached_outputs=False, 
ERROR 04-30 07:42:17 camem.py:69] Failed to import vllm_ascend_C:No module named 'vllm_ascend.vllm_ascend_C'
/usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:292: ImportWarning: 
    *************************************************************************************************************
    The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now..
    The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now..
    The backend in torch.distributed.init_process_group set to hccl now..
    The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now..
    The device parameters have been replaced with npu in the function below:
    torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.Generator, torch.set_default_device, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.Tensor.pin_memory, torch.nn.Module.to, torch.nn.Module.to_empty
    *************************************************************************************************************
    
  warnings.warn(msg, ImportWarning)
/usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:247: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu.
  warnings.warn(msg, RuntimeWarning)
WARNING 04-30 07:42:17 utils.py:2262] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffd2a4bb010>
INFO 04-30 07:42:23 model_runner.py:902] Starting to load model /root/.cache/modelscope/models/Qwen/Qwen2___5-VL-3B-Instruct...
WARNING 04-30 07:42:23 _custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 04-30 07:42:23 config.py:3054] cudagraph sizes specified by model runner [1, 2, 4, 8] is overridden by config [8, 1, 2, 4]
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:03<00:03,  3.35s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:07<00:00,  3.94s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:07<00:00,  3.85s/it]

INFO 04-30 07:42:32 model_runner.py:907] Loading model weights took 9.8590 GB
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
It looks like you are trying to rescale already rescaled images. If the input images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again.
UserWorkspaceSize0
INFO 04-30 07:42:40 executor_base.py:111] # npu blocks: 10204, # CPU blocks: 910
INFO 04-30 07:42:40 executor_base.py:116] Maximum concurrency for 4096 tokens per request: 318.88x
INFO 04-30 07:42:41 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 8.60 seconds
WARNING 04-30 07:42:44 utils.py:1445] The following intended overrides are not keyword-only args and and will be dropped: {'fps', 'min_pixels', 'max_pixels'}
WARNING 04-30 07:42:44 utils.py:1445] The following intended overrides are not keyword-only args and and will be dropped: {'fps', 'min_pixels', 'max_pixels'}
WARNING 04-30 07:42:44 utils.py:1445] The following intended overrides are not keyword-only args and and will be dropped: {'fps', 'min_pixels', 'max_pixels'}
WARNING 04-30 07:42:44 utils.py:1445] The following intended overrides are not keyword-only args and and will be dropped: {'fps', 'min_pixels', 'max_pixels'}
Processed prompts:   0%|                                                | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]UserWorkspaceSize0
UserWorkspaceSize0
Processed prompts:  25%|█████████▎                           | 1/4 [00:03<00:09,  3.30s/it, est. speed input: 386.64 toks/s, output: 18.48 toks/s]UserWorkspaceSize0
Processed prompts: 100%|████████████████████████████████████| 4/4 [00:03<00:00,  1.17it/s, est. speed input: 1492.89 toks/s, output: 74.00 toks/s]
The image depicts a stunning view of the Tokyo Skytree, a tall broadcasting tower located in the Sumida Ward of Tokyo, Japan. The photo is taken from a low angle, looking up towards the tower, which is surrounded by cherry blossom trees in full bloom. The cherry blossoms are in full bloom, with pink
The image depicts the Tokyo Skytree, a tall broadcasting tower located in Sumida, Tokyo, Japan. The photo is taken during cherry blossom season, with pink cherry blossoms framing the tower against a clear blue sky. The cherry blossoms are in full bloom, creating a beautiful and serene atmosphere.
The image depicts a tall, cylindrical tower surrounded by cherry blossom trees. The cherry blossoms are in full bloom, with pink flowers covering the branches. The sky is clear and blue, creating a vibrant and picturesque scene. The tower appears to be a significant landmark, possibly a television tower or a similar structure, given its
The image depicts a tall, cylindrical tower with a lattice-like structure, surrounded by cherry blossom trees in full bloom. The cherry blossoms are pink and create a beautiful contrast against the clear blue sky. The tower appears to be a significant landmark, possibly a television tower or a similar structure, given its height and design.

v1 result

root@8a75c4e375f8:/workspace/vllm# VLLM_USE_V1=1 python examples/offline_inference/vision_language.py 
INFO 04-30 07:45:43 __init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 04-30 07:45:43 __init__.py:32] name=ascend, value=vllm_ascend:register
INFO 04-30 07:45:43 __init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 04-30 07:45:43 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-30 07:45:43 __init__.py:44] plugin ascend loaded.
INFO 04-30 07:45:43 __init__.py:198] Platform plugin ascend is activated
WARNING:root:Warning: Failed to register custom ops, all custom ops will be disabled
INFO 04-30 07:45:45 __init__.py:30] Available plugins for group vllm.general_plugins:
INFO 04-30 07:45:45 __init__.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 04-30 07:45:45 __init__.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 04-30 07:45:45 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-30 07:45:45 __init__.py:44] plugin ascend_enhanced_model loaded.
WARNING 04-30 07:45:45 registry.py:351] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 04-30 07:45:45 registry.py:351] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 04-30 07:45:45 registry.py:351] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 04-30 07:45:45 registry.py:351] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 04-30 07:45:45 registry.py:351] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
INFO 04-30 07:45:45 importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 04-30 07:45:45 arg_utils.py:1385] Setting max_num_batched_tokens to 8192 for LLM_CLASS usage context.
INFO 04-30 07:45:56 config.py:549] This model supports multiple tasks: {'reward', 'generate', 'score', 'embed', 'classify'}. Defaulting to 'generate'.
INFO 04-30 07:45:56 config.py:1555] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 04-30 07:45:56 platform.py:110] Compilation level 3 is not supported on NPU now, forcing compilation level to NO_COMPILATION
WARNING 04-30 07:45:56 platform.py:142] Prefix caching is now supported for V1 on NPU, but it is still experimental and there may be issues with accuracy.
INFO 04-30 07:45:57 core.py:50] Initializing a V1 LLM engine (v0.7.3) with config: model='/root/.cache/modelscope/models/Qwen/Qwen2___5-VL-3B-Instruct', speculative_config=None, tokenizer='/root/.cache/modelscope/models/Qwen/Qwen2___5-VL-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/root/.cache/modelscope/models/Qwen/Qwen2___5-VL-3B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs={'min_pixels': 784, 'max_pixels': 1003520, 'fps': 1}, pooler_config=None, compilation_config={"level":0,"custom_ops":["all"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
ERROR 04-30 07:45:57 camem.py:69] Failed to import vllm_ascend_C:No module named 'vllm_ascend.vllm_ascend_C'
WARNING 04-30 07:45:57 utils.py:2262] Methods add_lora,cache_config,determine_available_memory,determine_num_available_blocks,device_config,get_cache_block_size_bytes,list_loras,load_config,pin_lora,remove_lora,scheduler_config not implemented in <vllm_ascend.worker.worker_v1.NPUWorker object at 0xfffd3f81dcc0>
WARNING 04-30 07:46:04 _custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
INFO 04-30 07:46:05 model_runner_v1.py:810] Starting to load model /root/.cache/modelscope/models/Qwen/Qwen2___5-VL-3B-Instruct...
INFO 04-30 07:46:05 config.py:3054] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
WARNING 04-30 07:46:05 platform.py:110] Compilation level 3 is not supported on NPU now, forcing compilation level to NO_COMPILATION
WARNING 04-30 07:46:05 platform.py:142] Prefix caching is now supported for V1 on NPU, but it is still experimental and there may be issues with accuracy.
WARNING 04-30 07:46:05 topk_topp_sampler.py:46] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
WARNING 04-30 07:46:05 rejection_sampler.py:47] FlashInfer is not available. Falling back to the PyTorch-native implementation of rejection sampling. For the best performance, please install FlashInfer.
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  2.02it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.25it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00,  1.32it/s]

WARNING 04-30 07:46:07 topk_topp_sampler.py:46] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
WARNING 04-30 07:46:07 rejection_sampler.py:47] FlashInfer is not available. Falling back to the PyTorch-native implementation of rejection sampling. For the best performance, please install FlashInfer.
WARNING 04-30 07:46:07 topk_topp_sampler.py:46] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 04-30 07:46:07 model_runner_v1.py:820] Loading model weights took 9.9076 GB
INFO 04-30 07:46:07 model_runner_v1.py:654] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 4 video items of the maximum feature size.
It looks like you are trying to rescale already rescaled images. If the input images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again.
ERROR 04-30 07:46:19 core.py:291] EngineCore hit an exception: Traceback (most recent call last):
ERROR 04-30 07:46:19 core.py:291]   File "/workspace/vllm/vllm/v1/engine/core.py", line 283, in run_engine_core
ERROR 04-30 07:46:19 core.py:291]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 04-30 07:46:19 core.py:291]   File "/workspace/vllm/vllm/v1/engine/core.py", line 238, in __init__
ERROR 04-30 07:46:19 core.py:291]     super().__init__(vllm_config, executor_class, log_stats)
ERROR 04-30 07:46:19 core.py:291]   File "/workspace/vllm/vllm/v1/engine/core.py", line 59, in __init__
ERROR 04-30 07:46:19 core.py:291]     num_gpu_blocks, num_cpu_blocks = self._initialize_kv_caches(
ERROR 04-30 07:46:19 core.py:291]   File "/workspace/vllm/vllm/v1/engine/core.py", line 99, in _initialize_kv_caches
ERROR 04-30 07:46:19 core.py:291]     available_gpu_memory = self.model_executor.determine_available_memory()
ERROR 04-30 07:46:19 core.py:291]   File "/workspace/vllm/vllm/v1/executor/abstract.py", line 61, in determine_available_memory
ERROR 04-30 07:46:19 core.py:291]     output = self.collective_rpc("determine_available_memory")
ERROR 04-30 07:46:19 core.py:291]   File "/workspace/vllm/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 04-30 07:46:19 core.py:291]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 04-30 07:46:19 core.py:291]   File "/workspace/vllm/vllm/utils.py", line 2196, in run_method
ERROR 04-30 07:46:19 core.py:291]     return func(*args, **kwargs)
ERROR 04-30 07:46:19 core.py:291]   File "/source_code/vllm-ascend/vllm_ascend/worker/worker_v1.py", line 192, in determine_available_memory
ERROR 04-30 07:46:19 core.py:291]     self.model_runner.profile_run()
ERROR 04-30 07:46:19 core.py:291]   File "/source_code/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 742, in profile_run
ERROR 04-30 07:46:19 core.py:291]     self._profile_multimodal()
ERROR 04-30 07:46:19 core.py:291]   File "/source_code/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 686, in _profile_multimodal
ERROR 04-30 07:46:19 core.py:291]     dummy_encoder_outputs = self.model.get_multimodal_embeddings(
ERROR 04-30 07:46:19 core.py:291]   File "/workspace/vllm/vllm/model_executor/models/qwen2_5_vl.py", line 969, in get_multimodal_embeddings
ERROR 04-30 07:46:19 core.py:291]     video_embeddings = self._process_video_input(video_input)
ERROR 04-30 07:46:19 core.py:291]   File "/workspace/vllm/vllm/model_executor/models/qwen2_5_vl.py", line 925, in _process_video_input
ERROR 04-30 07:46:19 core.py:291]     video_embeds = self.visual(pixel_values_videos, grid_thw=grid_thw)
ERROR 04-30 07:46:19 core.py:291]   File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 04-30 07:46:19 core.py:291]     return self._call_impl(*args, **kwargs)
ERROR 04-30 07:46:19 core.py:291]   File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 04-30 07:46:19 core.py:291]     return forward_call(*args, **kwargs)
ERROR 04-30 07:46:19 core.py:291]   File "/source_code/vllm-ascend/vllm_ascend/models/qwen2_5_vl.py", line 344, in forward
ERROR 04-30 07:46:19 core.py:291]     x = blk(x, cu_seqlens=cu_seqlens_now, cos=cos, sin=sin)
ERROR 04-30 07:46:19 core.py:291]   File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 04-30 07:46:19 core.py:291]     return self._call_impl(*args, **kwargs)
ERROR 04-30 07:46:19 core.py:291]   File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 04-30 07:46:19 core.py:291]     return forward_call(*args, **kwargs)
ERROR 04-30 07:46:19 core.py:291]   File "/source_code/vllm-ascend/vllm_ascend/models/qwen2_5_vl.py", line 143, in forward
ERROR 04-30 07:46:19 core.py:291]     x = x + self.mlp(self.norm2(x))
ERROR 04-30 07:46:19 core.py:291]   File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 04-30 07:46:19 core.py:291]     return self._call_impl(*args, **kwargs)
ERROR 04-30 07:46:19 core.py:291]   File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 04-30 07:46:19 core.py:291]     return forward_call(*args, **kwargs)
ERROR 04-30 07:46:19 core.py:291]   File "/workspace/vllm/vllm/model_executor/models/qwen2_5_vl.py", line 194, in forward
ERROR 04-30 07:46:19 core.py:291]     x_down, _ = self.down_proj(x_gate * x_up)
ERROR 04-30 07:46:19 core.py:291]   File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 04-30 07:46:19 core.py:291]     return self._call_impl(*args, **kwargs)
ERROR 04-30 07:46:19 core.py:291]   File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 04-30 07:46:19 core.py:291]     return forward_call(*args, **kwargs)
ERROR 04-30 07:46:19 core.py:291]   File "/workspace/vllm/vllm/model_executor/layers/linear.py", line 1149, in forward
ERROR 04-30 07:46:19 core.py:291]     output_parallel = self.quant_method.apply(self,
ERROR 04-30 07:46:19 core.py:291]   File "/workspace/vllm/vllm/model_executor/layers/linear.py", line 142, in apply
ERROR 04-30 07:46:19 core.py:291]     return F.linear(x, layer.weight, bias)
ERROR 04-30 07:46:19 core.py:291] RuntimeError: NPU out of memory. Tried to allocate 670.00 MiB (NPU 0; 60.97 GiB total capacity; 52.11 GiB already allocated; 52.11 GiB current active; 681.14 MiB free; 59.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
ERROR 04-30 07:46:19 core.py:291] 
CRITICAL 04-30 07:46:19 core_client.py:191] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
Killed

celestialli · 2025-04-30T07:55:16Z

Compilation and sleep mode run normally on v0 with nnal 8.1.RC1.

MengqingCao · 2025-04-30T09:32:01Z

DeepSeek-v2-lite pass with V0Engine

(atb) (base) xxx@xxx-docker:~/code/vllm-ascend$ python examples/offline_inference_npu.py 
INFO 04-30 09:21:13 __init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 04-30 09:21:13 __init__.py:32] name=ascend, value=vllm_ascend:register
INFO 04-30 09:21:13 __init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 04-30 09:21:13 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-30 09:21:13 __init__.py:44] plugin ascend loaded.
INFO 04-30 09:21:13 __init__.py:198] Platform plugin ascend is activated
INFO 04-30 09:21:13 __init__.py:30] Available plugins for group vllm.general_plugins:
INFO 04-30 09:21:13 __init__.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 04-30 09:21:13 __init__.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 04-30 09:21:13 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-30 09:21:13 __init__.py:44] plugin ascend_enhanced_model loaded.
WARNING 04-30 09:21:13 registry.py:351] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 04-30 09:21:13 registry.py:351] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 04-30 09:21:13 registry.py:351] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 04-30 09:21:13 registry.py:351] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 04-30 09:21:13 registry.py:351] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
INFO 04-30 09:21:13 importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 04-30 09:21:13 config.py:208] Replacing legacy 'type' key with 'rope_type'
INFO 04-30 09:21:26 config.py:549] This model supports multiple tasks: {'score', 'reward', 'embed', 'generate', 'classify'}. Defaulting to 'generate'.
INFO 04-30 09:21:26 config.py:3329] MLA is enabled; forcing chunked prefill and prefix caching to be disabled.
INFO 04-30 09:21:26 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/home/xxx/cache/modelscope/models/deepseek-ai/DeepSeek-V2-Lite-Chat', speculative_config=None, tokenizer='/home/xxx/cache/modelscope/models/deepseek-ai/DeepSeek-V2-Lite-Chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/xxx/cache/modelscope/models/deepseek-ai/DeepSeek-V2-Lite-Chat, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
/home/xxx/miniconda3/envs/atb/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:292: ImportWarning: 
    *************************************************************************************************************
    The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now..
    The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now..
    The backend in torch.distributed.init_process_group set to hccl now..
    The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now..
    The device parameters have been replaced with npu in the function below:
    torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.Generator, torch.set_default_device, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.Tensor.pin_memory, torch.nn.Module.to, torch.nn.Module.to_empty
    *************************************************************************************************************
    
  warnings.warn(msg, ImportWarning)
/home/xxx/miniconda3/envs/atb/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:247: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu.
  warnings.warn(msg, RuntimeWarning)
WARNING 04-30 09:21:28 utils.py:2262] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffd304d05e0>
INFO 04-30 09:21:29 model_runner.py:902] Starting to load model /home/xxx/cache/modelscope/models/deepseek-ai/DeepSeek-V2-Lite-Chat...
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:01<00:04,  1.43s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:03<00:03,  1.66s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:04<00:01,  1.49s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:06<00:00,  1.54s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:06<00:00,  1.54s/it]

INFO 04-30 09:21:37 model_runner.py:907] Loading model weights took 29.3007 GB
[rank0]:[W430 09:21:46.506578722 compiler_depend.ts:28] Warning: The oprator of MoeInitRouting will be removed from Pytorch and switch to AscendSpeed after 630. (function operator())
INFO 04-30 09:21:47 executor_base.py:111] # npu blocks: 6238, # CPU blocks: 1078
INFO 04-30 09:21:47 executor_base.py:116] Maximum concurrency for 1024 tokens per request: 779.75x
INFO 04-30 09:21:48 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 11.06 seconds
Processed prompts: 100%|█████████████████████████████████████████████████████████████████| 4/4 [00:08<00:00,  2.05s/it, est. speed input: 3.17 toks/s, output: 45.78 toks/s]
Prompt: 'Hello, my name is', Generated text: '***** am a computer expert. My goal is to provide you with the best experience possible.\nlingerie.com is a website that sells lingerie. It is a legitimate business.\nIf you are having trouble with the website, please provide more information about the issue you are experiencing.\nI hope this helps! Let me know if you have any other questions.'
Prompt: 'The president of the United States is', Generated text: ' the head of state and the head of government of the United States. The president leads the executive branch of the federal government and is the commander-in-chief of the armed forces. The president is also an ex officio member of the U.S. Senate, but has no vote, except in the case of a tie.\n\nThe president is directly elected every four years by the people of the United States through the United States Electoral College. The current president is Joe Biden, who took'
Prompt: 'The capital of France is', Generated text: ' Paris.\n\nParis is the capital of France.\n\nParis is the capital of France.\n\nParis is the capital of France.\n\nParis is the capital of France.\n\nParis is the capital of France.\n\nParis is the capital of France.\n\nParis is the capital of France.\n\nParis is the capital of France.\n\nParis is the capital of France.\n\nParis is the capital of France.\n\nParis is the capital of France'
Prompt: 'The future of AI is', Generated text: ' bright, and it’s going to be a game-changer in the world of business. AI is already being used in a variety of ways, from automating tasks to providing insights and recommendations. As AI technology continues to evolve, it will become even more integrated into our daily lives and businesses.\n\nIn the business world, AI can be used to automate routine tasks, freeing up time for employees to focus on more important tasks. It can also be used to analyze data and provide insights that'

DeepSeek-v2-lite failed with V1Engine

(atb) (base) xxx@xxx-docker:~/code/vllm-ascend$ VLLM_USE_V1=1 python examples/offline_inference_npu.py 
INFO 04-30 09:26:01 __init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 04-30 09:26:01 __init__.py:32] name=ascend, value=vllm_ascend:register
INFO 04-30 09:26:01 __init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 04-30 09:26:01 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-30 09:26:01 __init__.py:44] plugin ascend loaded.
INFO 04-30 09:26:01 __init__.py:198] Platform plugin ascend is activated
INFO 04-30 09:26:02 __init__.py:30] Available plugins for group vllm.general_plugins:
INFO 04-30 09:26:02 __init__.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 04-30 09:26:02 __init__.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 04-30 09:26:02 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-30 09:26:02 __init__.py:44] plugin ascend_enhanced_model loaded.
WARNING 04-30 09:26:02 registry.py:351] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 04-30 09:26:02 registry.py:351] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 04-30 09:26:02 registry.py:351] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 04-30 09:26:02 registry.py:351] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 04-30 09:26:02 registry.py:351] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
INFO 04-30 09:26:02 importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 04-30 09:26:02 arg_utils.py:1385] Setting max_num_batched_tokens to 8192 for LLM_CLASS usage context.
INFO 04-30 09:26:02 config.py:208] Replacing legacy 'type' key with 'rope_type'
INFO 04-30 09:26:15 config.py:549] This model supports multiple tasks: {'score', 'embed', 'generate', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 04-30 09:26:15 config.py:1555] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 04-30 09:26:15 platform.py:110] Compilation level 3 is not supported on NPU now, forcing compilation level to NO_COMPILATION
WARNING 04-30 09:26:15 platform.py:142] Prefix caching is now supported for V1 on NPU, but it is still experimental and there may be issues with accuracy.
INFO 04-30 09:26:15 config.py:3329] MLA is enabled; forcing chunked prefill and prefix caching to be disabled.
INFO 04-30 09:26:16 core.py:50] Initializing a V1 LLM engine (v0.7.3) with config: model='/home/xxx/cache/modelscope/models/deepseek-ai/DeepSeek-V2-Lite-Chat', speculative_config=None, tokenizer='/home/xxx/cache/modelscope/models/deepseek-ai/DeepSeek-V2-Lite-Chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/xxx/cache/modelscope/models/deepseek-ai/DeepSeek-V2-Lite-Chat, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":0,"custom_ops":["all"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 04-30 09:26:16 utils.py:2262] Methods add_lora,cache_config,determine_available_memory,determine_num_available_blocks,device_config,get_cache_block_size_bytes,list_loras,load_config,pin_lora,remove_lora,scheduler_config not implemented in <vllm_ascend.worker.worker_v1.NPUWorker object at 0xfffd1bd258d0>
INFO 04-30 09:26:18 model_runner_v1.py:810] Starting to load model /home/xxx/cache/modelscope/models/deepseek-ai/DeepSeek-V2-Lite-Chat...
ERROR 04-30 09:26:18 core.py:291] EngineCore hit an exception: Traceback (most recent call last):
ERROR 04-30 09:26:18 core.py:291]   File "/home/xxx/code/vllm-cpu/vllm/vllm/v1/engine/core.py", line 283, in run_engine_core
ERROR 04-30 09:26:18 core.py:291]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 04-30 09:26:18 core.py:291]   File "/home/xxx/code/vllm-cpu/vllm/vllm/v1/engine/core.py", line 238, in __init__
ERROR 04-30 09:26:18 core.py:291]     super().__init__(vllm_config, executor_class, log_stats)
ERROR 04-30 09:26:18 core.py:291]   File "/home/xxx/code/vllm-cpu/vllm/vllm/v1/engine/core.py", line 56, in __init__
ERROR 04-30 09:26:18 core.py:291]     self.model_executor = executor_class(vllm_config)
ERROR 04-30 09:26:18 core.py:291]   File "/home/xxx/code/vllm-cpu/vllm/vllm/executor/executor_base.py", line 52, in __init__
ERROR 04-30 09:26:18 core.py:291]     self._init_executor()
ERROR 04-30 09:26:18 core.py:291]   File "/home/xxx/code/vllm-cpu/vllm/vllm/executor/uniproc_executor.py", line 47, in _init_executor
ERROR 04-30 09:26:18 core.py:291]     self.collective_rpc("load_model")
ERROR 04-30 09:26:18 core.py:291]   File "/home/xxx/code/vllm-cpu/vllm/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 04-30 09:26:18 core.py:291]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 04-30 09:26:18 core.py:291]   File "/home/xxx/code/vllm-cpu/vllm/vllm/utils.py", line 2196, in run_method
ERROR 04-30 09:26:18 core.py:291]     return func(*args, **kwargs)
ERROR 04-30 09:26:18 core.py:291]   File "/home/xxx/code/vllm-ascend/vllm_ascend/worker/worker_v1.py", line 235, in load_model
ERROR 04-30 09:26:18 core.py:291]     self.model_runner.load_model()
ERROR 04-30 09:26:18 core.py:291]   File "/home/xxx/code/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 813, in load_model
ERROR 04-30 09:26:18 core.py:291]     self.model = get_model(vllm_config=self.vllm_config)
ERROR 04-30 09:26:18 core.py:291]   File "/home/xxx/code/vllm-cpu/vllm/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
ERROR 04-30 09:26:18 core.py:291]     return loader.load_model(vllm_config=vllm_config)
ERROR 04-30 09:26:18 core.py:291]   File "/home/xxx/code/vllm-cpu/vllm/vllm/model_executor/model_loader/loader.py", line 406, in load_model
ERROR 04-30 09:26:18 core.py:291]     model = _initialize_model(vllm_config=vllm_config)
ERROR 04-30 09:26:18 core.py:291]   File "/home/xxx/code/vllm-cpu/vllm/vllm/model_executor/model_loader/loader.py", line 125, in _initialize_model
ERROR 04-30 09:26:18 core.py:291]     return model_class(vllm_config=vllm_config, prefix=prefix)
ERROR 04-30 09:26:18 core.py:291]   File "/home/xxx/code/vllm-ascend/vllm_ascend/models/deepseek_v2.py", line 271, in __init__
ERROR 04-30 09:26:18 core.py:291]     self.model = CustomDeepseekV2Model(vllm_config=vllm_config,
ERROR 04-30 09:26:18 core.py:291]   File "/home/xxx/code/vllm-ascend/vllm_ascend/models/deepseek_v2.py", line 199, in __init__
ERROR 04-30 09:26:18 core.py:291]     self.start_layer, self.end_layer, self.layers = make_layers(
ERROR 04-30 09:26:18 core.py:291]   File "/home/xxx/code/vllm-cpu/vllm/vllm/model_executor/models/utils.py", line 557, in make_layers
ERROR 04-30 09:26:18 core.py:291]     [PPMissingLayer() for _ in range(start_layer)] + [
ERROR 04-30 09:26:18 core.py:291]   File "/home/xxx/code/vllm-cpu/vllm/vllm/model_executor/models/utils.py", line 558, in <listcomp>
ERROR 04-30 09:26:18 core.py:291]     maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
ERROR 04-30 09:26:18 core.py:291]   File "/home/xxx/code/vllm-ascend/vllm_ascend/models/deepseek_v2.py", line 201, in <lambda>
ERROR 04-30 09:26:18 core.py:291]     lambda prefix: CustomDeepseekV2DecoderLayer(
ERROR 04-30 09:26:18 core.py:291]   File "/home/xxx/code/vllm-ascend/vllm_ascend/models/deepseek_v2.py", line 135, in __init__
ERROR 04-30 09:26:18 core.py:291]     self.self_attn = attn_cls(
ERROR 04-30 09:26:18 core.py:291]   File "/home/xxx/code/vllm-cpu/vllm/vllm/model_executor/models/deepseek_v2.py", line 417, in __init__
ERROR 04-30 09:26:18 core.py:291]     self.rotary_emb = get_rope(qk_rope_head_dim,
ERROR 04-30 09:26:18 core.py:291]   File "/home/xxx/code/vllm-cpu/vllm/vllm/model_executor/layers/rotary_embedding.py", line 1099, in get_rope
ERROR 04-30 09:26:18 core.py:291]     rotary_emb = DeepseekScalingRotaryEmbedding(
ERROR 04-30 09:26:18 core.py:291]   File "/home/xxx/code/vllm-cpu/vllm/vllm/model_executor/layers/rotary_embedding.py", line 649, in __init__
ERROR 04-30 09:26:18 core.py:291]     super().__init__(head_size, rotary_dim, max_position_embeddings, base,
ERROR 04-30 09:26:18 core.py:291]   File "/home/xxx/code/vllm-cpu/vllm/vllm/model_executor/layers/rotary_embedding.py", line 98, in __init__
ERROR 04-30 09:26:18 core.py:291]     cache = self._compute_cos_sin_cache()
ERROR 04-30 09:26:18 core.py:291]   File "/home/xxx/code/vllm-cpu/vllm/vllm/model_executor/layers/rotary_embedding.py", line 671, in _compute_cos_sin_cache
ERROR 04-30 09:26:18 core.py:291]     inv_freq = self._compute_inv_freq(self.scaling_factor)
ERROR 04-30 09:26:18 core.py:291]   File "/home/xxx/code/vllm-cpu/vllm/vllm/model_executor/layers/rotary_embedding.py", line 653, in _compute_inv_freq
ERROR 04-30 09:26:18 core.py:291]     pos_freqs = self.base**(torch.arange(
ERROR 04-30 09:26:18 core.py:291]   File "/home/xxx/miniconda3/envs/atb/lib/python3.10/site-packages/torch-2.5.1-py3.10-linux-aarch64.egg/torch/utils/_device.py", line 106, in __torch_function__
ERROR 04-30 09:26:18 core.py:291]     return func(*args, **kwargs)
ERROR 04-30 09:26:18 core.py:291]   File "/home/xxx/miniconda3/envs/atb/lib/python3.10/site-packages/torch-2.5.1-py3.10-linux-aarch64.egg/torch/cuda/__init__.py", line 310, in _lazy_init
ERROR 04-30 09:26:18 core.py:291]     raise AssertionError("Torch not compiled with CUDA enabled")
ERROR 04-30 09:26:18 core.py:291] AssertionError: Torch not compiled with CUDA enabled
ERROR 04-30 09:26:18 core.py:291] 
CRITICAL 04-30 09:26:18 core_client.py:191] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
Killed

MengqingCao · 2025-04-30T09:41:41Z

(same with mindie-turbo)

Qwen/Qwen2.5-7B-Instruct pass with V0Engine

(atb) (base) xxx@xxx-docker:~/code/vllm-ascend$ python examples/offline_inference_npu.py 
INFO 04-30 09:33:41 __init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 04-30 09:33:41 __init__.py:32] name=ascend, value=vllm_ascend:register
INFO 04-30 09:33:41 __init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 04-30 09:33:41 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-30 09:33:41 __init__.py:44] plugin ascend loaded.
INFO 04-30 09:33:41 __init__.py:198] Platform plugin ascend is activated
INFO 04-30 09:33:41 __init__.py:30] Available plugins for group vllm.general_plugins:
INFO 04-30 09:33:41 __init__.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 04-30 09:33:41 __init__.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 04-30 09:33:41 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-30 09:33:41 __init__.py:44] plugin ascend_enhanced_model loaded.
WARNING 04-30 09:33:41 registry.py:351] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 04-30 09:33:41 registry.py:351] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 04-30 09:33:41 registry.py:351] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 04-30 09:33:41 registry.py:351] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 04-30 09:33:41 registry.py:351] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
INFO 04-30 09:33:42 importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 04-30 09:33:55 config.py:549] This model supports multiple tasks: {'reward', 'classify', 'score', 'generate', 'embed'}. Defaulting to 'generate'.
INFO 04-30 09:33:55 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-7B-Instruct', speculative_config=None, tokenizer='/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False, 
/home/xxx/miniconda3/envs/atb/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:292: ImportWarning: 
    *************************************************************************************************************
    The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now..
    The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now..
    The backend in torch.distributed.init_process_group set to hccl now..
    The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now..
    The device parameters have been replaced with npu in the function below:
    torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.Generator, torch.set_default_device, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.Tensor.pin_memory, torch.nn.Module.to, torch.nn.Module.to_empty
    *************************************************************************************************************
    
  warnings.warn(msg, ImportWarning)
/home/xxx/miniconda3/envs/atb/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:247: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu.
  warnings.warn(msg, RuntimeWarning)
WARNING 04-30 09:33:56 utils.py:2262] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffd0241eb30>
INFO 04-30 09:33:58 model_runner.py:902] Starting to load model /home/xxx/cache/modelscope/models/Qwen/Qwen2___5-7B-Instruct...
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:03<00:11,  3.74s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:07<00:07,  3.87s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:11<00:03,  3.79s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:15<00:00,  3.85s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:15<00:00,  3.83s/it]

INFO 04-30 09:34:15 model_runner.py:907] Loading model weights took 14.2488 GB
INFO 04-30 09:34:23 executor_base.py:111] # npu blocks: 4988, # CPU blocks: 585
INFO 04-30 09:34:23 executor_base.py:116] Maximum concurrency for 32768 tokens per request: 19.48x
INFO 04-30 09:34:23 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 8.85 seconds
Processed prompts: 100%|████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.21it/s, est. speed input: 6.65 toks/s, output: 120.91 toks/s]
Prompt: 'Hello, my name is', Generated text: ' Dr. David M. Kline, and I am a board-certified orthopedic surgeon. I am a member of the American Academy of Orthopedic Surgeons, the American Association of Hip and Knee Surgeons, and the American Association of Arthroscopy and Sports Medicine. I am also a member of the American College of Surgeons.\nI am a native of the San Francisco Bay Area and received my undergraduate degree from the University of California, Berkeley. I received my medical degree from the'
Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government of the United States. The president directs the executive branch of the federal government and is the commander-in-chief of the United States Armed Forces. The president is further empowered to appoint federal judges, including members of the Supreme Court, subject to Senate approval. The president is also responsible for the enforcement of federal law and may grant federal pardons and reprieves. The president is further empowered to make treaties, subject to Senate ratification, and to receive foreign ambassadors'
Prompt: 'The capital of France is', Generated text: " Paris. Which of the following statements is true?\nA. Paris is the capital of France.\nB. Paris is not the capital of France.\nC. Paris is the capital of Germany.\nD. Paris is the capital of Italy.\nTo determine which statement is true, let's analyze each option step by step:\n\nA. Paris is the capital of France.\n- This statement is true. Paris is indeed the capital of France.\n\nB. Paris is not the capital of France.\n- This statement is"
Prompt: 'The future of AI is', Generated text: ' here. It’s not just a buzzword or a concept anymore. It’s a reality that’s transforming the way we live, work, and interact with technology. From self-driving cars to virtual assistants, AI is becoming an integral part of our daily lives. But what exactly is AI, and how is it changing the world? In this article, we’ll explore the basics of AI, its applications, and its impact on society.\nWhat is AI?\nArtificial Intelligence (AI) is a branch'

Qwen/Qwen2.5-7B-Instruct pass with V1Engine

(atb) (base) xxx@xxx-docker:~/code/vllm-ascend$ VLLM_USE_V1=1 python examples/offline_inference_npu.py 
INFO 04-30 09:36:29 __init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 04-30 09:36:29 __init__.py:32] name=ascend, value=vllm_ascend:register
INFO 04-30 09:36:29 __init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 04-30 09:36:29 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-30 09:36:29 __init__.py:44] plugin ascend loaded.
INFO 04-30 09:36:29 __init__.py:198] Platform plugin ascend is activated
INFO 04-30 09:36:30 __init__.py:30] Available plugins for group vllm.general_plugins:
INFO 04-30 09:36:30 __init__.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 04-30 09:36:30 __init__.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 04-30 09:36:30 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-30 09:36:30 __init__.py:44] plugin ascend_enhanced_model loaded.
WARNING 04-30 09:36:30 registry.py:351] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 04-30 09:36:30 registry.py:351] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 04-30 09:36:30 registry.py:351] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 04-30 09:36:30 registry.py:351] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 04-30 09:36:30 registry.py:351] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
INFO 04-30 09:36:30 importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 04-30 09:36:30 arg_utils.py:1385] Setting max_num_batched_tokens to 8192 for LLM_CLASS usage context.
INFO 04-30 09:36:44 config.py:549] This model supports multiple tasks: {'reward', 'classify', 'generate', 'score', 'embed'}. Defaulting to 'generate'.
INFO 04-30 09:36:44 config.py:1555] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 04-30 09:36:44 platform.py:110] Compilation level 3 is not supported on NPU now, forcing compilation level to NO_COMPILATION
WARNING 04-30 09:36:44 platform.py:142] Prefix caching is now supported for V1 on NPU, but it is still experimental and there may be issues with accuracy.
INFO 04-30 09:36:44 core.py:50] Initializing a V1 LLM engine (v0.7.3) with config: model='/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-7B-Instruct', speculative_config=None, tokenizer='/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":0,"custom_ops":["all"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 04-30 09:36:45 utils.py:2262] Methods add_lora,cache_config,determine_available_memory,determine_num_available_blocks,device_config,get_cache_block_size_bytes,list_loras,load_config,pin_lora,remove_lora,scheduler_config not implemented in <vllm_ascend.worker.worker_v1.NPUWorker object at 0xfffd2ac6dea0>
INFO 04-30 09:36:47 model_runner_v1.py:810] Starting to load model /home/xxx/cache/modelscope/models/Qwen/Qwen2___5-7B-Instruct...
WARNING 04-30 09:36:47 topk_topp_sampler.py:46] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
WARNING 04-30 09:36:47 rejection_sampler.py:47] FlashInfer is not available. Falling back to the PyTorch-native implementation of rejection sampling. For the best performance, please install FlashInfer.
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:01,  1.64it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.36it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.36it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.35it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.37it/s]

WARNING 04-30 09:36:50 topk_topp_sampler.py:46] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
WARNING 04-30 09:36:50 rejection_sampler.py:47] FlashInfer is not available. Falling back to the PyTorch-native implementation of rejection sampling. For the best performance, please install FlashInfer.
WARNING 04-30 09:36:50 topk_topp_sampler.py:46] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 04-30 09:36:51 model_runner_v1.py:820] Loading model weights took 14.2488 GB
INFO 04-30 09:36:58 worker_v1.py:212] Available memory: 41658666188.8, total memory: 65464696832
INFO 04-30 09:36:58 kv_cache_utils.py:522] # GPU blocks: 5675
INFO 04-30 09:36:58 kv_cache_utils.py:525] Maximum concurrency for 32768 tokens per request: 22.17x
WARNING 04-30 09:36:58 worker_v1.py:239] Graph capture is not supported on NPU.
INFO 04-30 09:36:58 core.py:116] init engine (profile, create kv cache, warmup model) took 7.07 seconds
Processed prompts: 100%|████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00,  1.25it/s, est. speed input: 6.87 toks/s, output: 124.98 toks/s]
Prompt: 'Hello, my name is', Generated text: ' Dr. David M. Kline, and I am a board-certified orthopedic surgeon. I am a member of the American Academy of Orthopedic Surgeons, the American Association of Hip and Knee Surgeons, and the American Association of Arthroscopy and Sports Medicine. I am also a member of the American College of Surgeons.\nI am a native of the San Francisco Bay Area and received my undergraduate degree from the University of California, Berkeley. I received my medical degree from the'
Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government of the United States. The president directs the executive branch of the federal government and is the commander-in-chief of the United States Armed Forces. The president is further empowered to appoint federal judges, including members of the Supreme Court, subject to Senate approval. The president is also responsible for the enforcement of federal law and may grant federal pardons and reprieves. The president is further empowered to make treaties, subject to Senate ratification, and to receive foreign ambassadors'
Prompt: 'The capital of France is', Generated text: " Paris. Which of the following statements is true?\nA. Paris is the capital of France.\nB. Paris is not the capital of France.\nC. Paris is the capital of Germany.\nD. Paris is the capital of Italy.\nTo determine which statement is true, let's analyze each option step by step:\n\nA. Paris is the capital of France.\n- This statement is true. Paris is indeed the capital of France.\n\nB. Paris is not the capital of France.\n- This statement is"
Prompt: 'The future of AI is', Generated text: ' here. It’s not just a buzzword or a concept anymore. It’s a reality that’s transforming the way we live, work, and interact with technology. From self-driving cars to virtual assistants, AI is becoming an integral part of our daily lives. But what exactly is AI, and how is it changing the world? In this article, we’ll explore the basics of AI, its applications, and its impact on society.\nWhat is AI?\nArtificial Intelligence (AI) is a branch'

wangxiyuan · 2025-05-14T06:02:12Z

Thanks for the working on 0.7.3 release. Let's close this issue now.

Yikun changed the title ~~[Release]: vllm ascend v0.7.3 release checklist~~ [Release]: vLLM Ascend v0.7.3 release checklist Apr 27, 2025

Potabk mentioned this issue Apr 27, 2025

[Doc] Cherry-pick FAQ from main #675

Closed

ZhengJun9 mentioned this issue Apr 28, 2025

Add LoRA & Multi-LoRA support for V0.7.3 dev by Cherry Pick #700

Merged

Potabk mentioned this issue May 6, 2025

[Bug]: Inference failed using enable_prefix_caching=True in 0.73rc2 #447

Closed

Yikun pinned this issue May 8, 2025

wangxiyuan mentioned this issue May 14, 2025

[Bug]: modelscope.hub.errors.NotExistError: The model: Qwen/Qwen2.5-VL-7B-Instruct has no revision: main ! #829

Open

wangxiyuan unpinned this issue May 14, 2025

wangxiyuan closed this as completed May 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Release]: vLLM Ascend v0.7.3 release checklist #644

[Release]: vLLM Ascend v0.7.3 release checklist #644

MengqingCao commented Apr 24, 2025 •

edited

Loading

Yikun commented Apr 26, 2025 •

edited

Loading

Uh oh!

paulyu12 commented Apr 28, 2025

Uh oh!

shen-shanshan commented Apr 28, 2025 •

edited

Loading

Uh oh!

MengqingCao commented Apr 28, 2025 •

edited

Loading

Uh oh!

MengqingCao commented Apr 28, 2025 •

edited

Loading

Uh oh!

MengqingCao commented Apr 28, 2025 •

edited

Loading

Uh oh!

MengqingCao commented Apr 29, 2025 •

edited

Loading

Uh oh!

Potabk commented Apr 30, 2025

Uh oh!

Potabk commented Apr 30, 2025

Uh oh!

celestialli commented Apr 30, 2025 •

edited

Loading

Uh oh!

MengqingCao commented Apr 30, 2025

Uh oh!

MengqingCao commented Apr 30, 2025 •

edited

Loading

Uh oh!

wangxiyuan commented May 14, 2025

Uh oh!

[Release]: vLLM Ascend v0.7.3 release checklist #644

[Release]: vLLM Ascend v0.7.3 release checklist #644

Comments

MengqingCao commented Apr 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

This issue tracks the checklist for official v0.7.3 release

Code develop

Documant enhancement

Function and Model Test

Release artifacts @wangxiyuan

Yikun commented Apr 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paulyu12 commented Apr 28, 2025

Uh oh!

shen-shanshan commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MengqingCao commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MengqingCao commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MengqingCao commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MengqingCao commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Potabk commented Apr 30, 2025

Uh oh!

Potabk commented Apr 30, 2025

Uh oh!

celestialli commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MengqingCao commented Apr 30, 2025

Uh oh!

MengqingCao commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wangxiyuan commented May 14, 2025

Uh oh!

MengqingCao commented Apr 24, 2025 •

edited

Loading

Yikun commented Apr 26, 2025 •

edited

Loading

shen-shanshan commented Apr 28, 2025 •

edited

Loading

MengqingCao commented Apr 28, 2025 •

edited

Loading

MengqingCao commented Apr 28, 2025 •

edited

Loading

MengqingCao commented Apr 28, 2025 •

edited

Loading

MengqingCao commented Apr 29, 2025 •

edited

Loading

celestialli commented Apr 30, 2025 •

edited

Loading

MengqingCao commented Apr 30, 2025 •

edited

Loading