-
Notifications
You must be signed in to change notification settings - Fork 172
[Release]: vLLM Ascend v0.7.3 release checklist #644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Also update feature support doc like #650, for v0.7.3: we should Add |
@ZhengJun9 Help PR lora support cherry-pick. Thx. |
Guided Decoding:
V1 Engine:
Distribution:
|
Pooling model test pass with `test_scoring` and `test_embedding` on V0;
|
### What this PR does / why we need it? According to this [RFC]( #396 ) and [this](#448), we pull request relavant code to support (1) Multi-LoRA and (2) Multi-LoRA Dynamic Serving. LoRA reference is here: [LoRA reference](https://docs.vllm.ai/en/latest/features/lora.html) ### Does this PR introduce _any_ user-facing change? Following openai HTTP apis will be supported: /v1/load_lora_adapter /v1/unload_lora_adapter ### How was this patch tested? git clone [https://github.com/vllm-project/vllm.git](https://github.com/vllm-project/vllm.git) cd vllm/examples/offline_inference/ && python3 multilora_inference.py > [[Release]: vLLM Ascend v0.7.3 release checklist ](#644 (comment)) --------- Signed-off-by: paulyu <paulyu0307@gmail.com> Co-authored-by: paulyu12 <507435917@qq.com> Co-authored-by: paulyu <paulyu0307@gmail.com>
Multi-step test pass with `tests/singlecard/multi_step/test_correctness_llm.py`/home/xxx/miniconda3/envs/atb/lib/python3.10/site-packages/pytest_asyncio/plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"
warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
=========================================================================== test session starts ============================================================================
platform linux -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0 -- /home/xxx/miniconda3/envs/atb/bin/python
cachedir: .pytest_cache
rootdir: /home/xxx/code/vllm-ascend
configfile: pytest.ini
plugins: shard-0.1.2, markdown-docs-0.9.0, rerunfailures-15.0, md-0.2.0, asyncio-0.25.3, anyio-4.8.0, mock-3.14.0, forked-1.6.0, typeguard-4.3.0
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None
collected 1 item
Running 1 items in this shard: tests/singlecard/multi_step/test_correctness_llm.py::test_multi_step_llm_w_prompt_logprobs[5-5-10-8-True-5-1-bfloat16-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct]
tests/singlecard/multi_step/test_correctness_llm.py::test_multi_step_llm_w_prompt_logprobs[5-5-10-8-True-5-1-bfloat16-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct] INFO 04-29 08:04:33 __init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 04-29 08:04:33 __init__.py:32] name=ascend, value=vllm_ascend:register
INFO 04-29 08:04:33 __init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 04-29 08:04:33 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-29 08:04:33 __init__.py:44] plugin ascend loaded.
INFO 04-29 08:04:33 __init__.py:198] Platform plugin ascend is activated
INFO 04-29 08:04:34 __init__.py:30] Available plugins for group vllm.general_plugins:
INFO 04-29 08:04:34 __init__.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 04-29 08:04:34 __init__.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 04-29 08:04:34 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-29 08:04:34 __init__.py:44] plugin ascend_enhanced_model loaded.
WARNING 04-29 08:04:34 registry.py:351] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:CustomQwen2VLForConditionalGeneration.
WARNING 04-29 08:04:34 registry.py:351] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 04-29 08:04:34 registry.py:351] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 04-29 08:04:34 registry.py:351] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 04-29 08:04:34 registry.py:351] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
INFO 04-29 08:04:34 importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 04-29 08:04:47 config.py:549] This model supports multiple tasks: {'generate', 'embed', 'classify', 'reward', 'score'}. Defaulting to 'generate'.
INFO 04-29 08:04:47 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct', speculative_config=None, tokenizer='/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct, num_scheduler_steps=8, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False,
WARNING 04-29 08:04:48 utils.py:2262] Methods add_lora,add_prompt_adapter,cache_config,compilation_config,current_platform,list_loras,list_prompt_adapters,load_config,pin_lora,pin_prompt_adapter,remove_lora,remove_prompt_adapter not implemented in <vllm_ascend.worker.multi_step_worker.MultiStepWorker object at 0xfffd2df56c50>
WARNING 04-29 08:04:48 registry.py:335] `mm_limits` has already been set for model=/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct, and will be overwritten by the new values.
INFO 04-29 08:04:50 model_runner.py:822] Starting to load model /home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct...
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 4.54it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 4.53it/s]
INFO 04-29 08:04:51 model_runner.py:827] Loading model weights took 0.9277 GB
INFO 04-29 08:04:56 executor_base.py:111] # npu blocks: 215234, # CPU blocks: 21845
INFO 04-29 08:04:56 executor_base.py:116] Maximum concurrency for 1024 tokens per request: 3363.03x
INFO 04-29 08:04:56 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 5.29 seconds
Processed prompts: 100%|████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 38.20it/s, est. speed input: 760.49 toks/s, output: 191.06 toks/s]
INFO 04-29 08:05:13 config.py:549] This model supports multiple tasks: {'generate', 'embed', 'classify', 'reward', 'score'}. Defaulting to 'generate'.
INFO 04-29 08:05:13 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct', speculative_config=None, tokenizer='/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False,
WARNING 04-29 08:05:14 utils.py:2262] Methods add_lora,add_prompt_adapter,cache_config,compilation_config,current_platform,list_loras,list_prompt_adapters,load_config,pin_lora,pin_prompt_adapter,remove_lora,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffa0403e530>
INFO 04-29 08:05:14 model_runner.py:822] Starting to load model /home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct...
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 4.50it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 4.50it/s]
INFO 04-29 08:05:14 model_runner.py:827] Loading model weights took 0.9232 GB
INFO 04-29 08:05:15 executor_base.py:111] # npu blocks: 217132, # CPU blocks: 21845
INFO 04-29 08:05:15 executor_base.py:116] Maximum concurrency for 1024 tokens per request: 3392.69x
INFO 04-29 08:05:15 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 0.87 seconds
Processed prompts: 100%|████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 41.00it/s, est. speed input: 816.15 toks/s, output: 205.05 toks/s]
PASSED
============================================================================= warnings summary =============================================================================
tests/singlecard/multi_step/test_correctness_llm.py::test_multi_step_llm_w_prompt_logprobs[5-5-10-8-True-5-1-bfloat16-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct]
/home/xxx/miniconda3/envs/atb/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:292: ImportWarning:
*************************************************************************************************************
The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now..
The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now..
The backend in torch.distributed.init_process_group set to hccl now..
The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now..
The device parameters have been replaced with npu in the function below:
torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.Generator, torch.set_default_device, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.Tensor.pin_memory, torch.nn.Module.to, torch.nn.Module.to_empty
*************************************************************************************************************
warnings.warn(msg, ImportWarning)
tests/singlecard/multi_step/test_correctness_llm.py::test_multi_step_llm_w_prompt_logprobs[5-5-10-8-True-5-1-bfloat16-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct]
/home/xxx/miniconda3/envs/atb/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:247: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu.
warnings.warn(msg, RuntimeWarning)
tests/singlecard/multi_step/test_correctness_llm.py::test_multi_step_llm_w_prompt_logprobs[5-5-10-8-True-5-1-bfloat16-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct]
/home/xxx/code/vllm-cpu/vllm/vllm/executor/uniproc_executor.py:29: ResourceWarning: unclosed <socket.socket fd=14, family=AddressFamily.AF_INET, type=SocketKind.SOCK_DGRAM, proto=0, laddr=('172.20.0.2', 54388), raddr=('8.8.8.8', 80)>
get_ip(), get_open_port())
Enable tracemalloc to get traceback where the object was allocated.
See https://docs.pytest.org/en/stable/how-to/capture-warnings.html#resource-warnings for more info.
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
====================================================================== 1 passed, 3 warnings in 56.76s ====================================================================== (same with mindie-turbo) |
update:The output of chunked prefill could not align to transformers with both cann 8.0.0.beta1 and cann 8.1.rc1.beta1
failed testimport os
import pytest
from tests.model_utils import check_logprobs_close, check_outputs_equal
MODELS = [
# "facebook/opt-125m",
"/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct"
# "meta-llama/Llama-3.2-1B-Instruct",
]
@pytest.mark.parametrize("model", MODELS)
@pytest.mark.parametrize("dtype", ["half"])
@pytest.mark.parametrize("max_tokens", [32])
@pytest.mark.parametrize("chunked_prefill_token_size", [1,
4, 16
])
@pytest.mark.parametrize("enforce_eager", [True])
# NOTE: Increasing this in this suite will fail CI because we currently cannot
# reset distributed env properly. Use a value > 1 just when you test.
@pytest.mark.parametrize("tensor_parallel_size", [1])
def test_models(
hf_runner,
vllm_runner,
example_prompts,
model: str,
dtype: str,
max_tokens: int,
chunked_prefill_token_size: int,
enforce_eager: bool,
tensor_parallel_size: int,
) -> None:
"""
Checks exact match decode between huggingface model and vllm runner with
chunked prefill.
"""
max_num_seqs = chunked_prefill_token_size
max_num_batched_tokens = chunked_prefill_token_size
with hf_runner(model, dtype=dtype) as hf_model:
hf_outputs = hf_model.generate_greedy(example_prompts, max_tokens)
with vllm_runner(
model,
dtype=dtype,
max_num_batched_tokens=max_num_batched_tokens,
enable_chunked_prefill=True,
tensor_parallel_size=tensor_parallel_size,
enforce_eager=enforce_eager,
max_num_seqs=max_num_seqs,
) as vllm_model:
vllm_outputs = vllm_model.generate_greedy(example_prompts, max_tokens)
print(hf_outputs)
print(100*"*")
print(vllm_outputs)
check_outputs_equal(
outputs_0_lst=hf_outputs,
outputs_1_lst=vllm_outputs,
name_0="hf",
name_1="vllm",
) The output of chunked prefill could not align to transformers/home/xxx/miniconda3/envs/atb/lib/python3.10/site-packages/pytest_asyncio/plugin.py:207: PytestDeprecationWarning: The configuration option "asyncio_default_fixture_loop_scope" is unset.
The event loop scope for asynchronous fixtures will default to the fixture caching scope. Future versions of pytest-asyncio will default the loop scope for asynchronous fixtures to function scope. Set the default fixture loop scope explicitly in order to avoid unexpected behavior in the future. Valid fixture loop scopes are: "function", "class", "module", "package", "session"
warnings.warn(PytestDeprecationWarning(_DEFAULT_FIXTURE_LOOP_SCOPE_UNSET))
=========================================================================== test session starts ============================================================================
platform linux -- Python 3.10.16, pytest-8.3.4, pluggy-1.5.0 -- /home/xxx/miniconda3/envs/atb/bin/python
cachedir: .pytest_cache
rootdir: /home/xxx/code/vllm-ascend
configfile: pytest.ini
plugins: shard-0.1.2, markdown-docs-0.9.0, rerunfailures-15.0, md-0.2.0, asyncio-0.25.3, anyio-4.8.0, mock-3.14.0, forked-1.6.0, typeguard-4.3.0
asyncio: mode=strict, asyncio_default_fixture_loop_scope=None
collected 3 items
Running 3 items in this shard: tests/test_chunk_prefill.py::test_models[1-True-1-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct], tests/test_chunk_prefill.py::test_models[1-True-4-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct], tests/test_chunk_prefill.py::test_models[1-True-16-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct]
tests/test_chunk_prefill.py::test_models[1-True-1-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct] INFO 04-29 08:10:02 __init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 04-29 08:10:02 __init__.py:32] name=ascend, value=vllm_ascend:register
INFO 04-29 08:10:02 __init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 04-29 08:10:02 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-29 08:10:02 __init__.py:44] plugin ascend loaded.
INFO 04-29 08:10:02 __init__.py:198] Platform plugin ascend is activated
WARNING 04-29 08:10:02 config.py:2448] Casting torch.bfloat16 to torch.float16.
Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results may be encountered.
INFO 04-29 08:10:20 __init__.py:30] Available plugins for group vllm.general_plugins:
INFO 04-29 08:10:20 __init__.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 04-29 08:10:20 __init__.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 04-29 08:10:20 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-29 08:10:20 __init__.py:44] plugin ascend_enhanced_model loaded.
WARNING 04-29 08:10:20 registry.py:351] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:CustomQwen2VLForConditionalGeneration.
WARNING 04-29 08:10:20 registry.py:351] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 04-29 08:10:20 registry.py:351] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 04-29 08:10:20 registry.py:351] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 04-29 08:10:20 registry.py:351] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
INFO 04-29 08:10:20 importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 04-29 08:10:20 config.py:2448] Casting torch.bfloat16 to torch.float16.
INFO 04-29 08:10:33 config.py:549] This model supports multiple tasks: {'embed', 'score', 'classify', 'reward', 'generate'}. Defaulting to 'generate'.
INFO 04-29 08:10:33 config.py:1555] Chunked prefill is enabled with max_num_batched_tokens=1.
INFO 04-29 08:10:33 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct', speculative_config=None, tokenizer='/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False,
WARNING 04-29 08:10:35 utils.py:2262] Methods add_lora,add_prompt_adapter,cache_config,compilation_config,current_platform,list_loras,list_prompt_adapters,load_config,pin_lora,pin_prompt_adapter,remove_lora,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffcb00f6830>
INFO 04-29 08:10:35 model_runner.py:822] Starting to load model /home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct...
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.85it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.84it/s]
INFO 04-29 08:10:36 model_runner.py:827] Loading model weights took 0.9277 GB
INFO 04-29 08:10:39 executor_base.py:111] # npu blocks: 293888, # CPU blocks: 21845
INFO 04-29 08:10:39 executor_base.py:116] Maximum concurrency for 1024 tokens per request: 4592.00x
INFO 04-29 08:10:40 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 4.11 seconds
Processed prompts: 100%|████████████████████████████████████████████████████████████████| 8/8 [00:11<00:00, 1.39s/it, est. speed input: 13.82 toks/s, output: 22.97 toks/s]
[([85, 4086, 44, 374, 264, 1550, 42747, 628, 323, 4938, 72816, 44378, 323, 13480, 4712, 369, 444, 10994, 82, 624, 2132, 6147, 3847, 311, 1598, 44378, 389, 3460, 12934, 4119, 304, 15279, 11, 1393, 1083, 8241, 11050, 2473, 9691, 13, 1084, 11554, 2176, 4237, 323, 2205, 11320, 11, 448, 279, 5726, 311], 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\nIt allows users to run inference on large-scale models in parallel, while also providing efficient service delivery. It supports both distributed and local execution, with the ability to'), ([85984, 398, 7512, 279, 3598, 68276, 304, 279, 4401, 315, 20443, 11229, 504, 220, 16, 24, 20, 15, 311, 220, 17, 15, 17, 15, 624, 785, 4401, 315, 58194, 21392, 320, 15469, 8, 702, 1012, 264, 1293, 323, 6351, 1882, 429, 6009, 448, 279, 4124, 975, 389, 5662, 6832, 25185, 304, 279, 5099, 12, 17, 15, 339], 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\nThe development of Artificial Intelligence (AI) has been a long and complex process that began with the early work on machine learning algorithms in the mid-20th'), ([27374, 323, 12872, 20443, 11229, 448, 3738, 11229, 304, 3793, 315, 8692, 1995, 624, 9286, 16488, 21392, 320, 15469, 8, 323, 11097, 21392, 320, 23913, 8, 525, 1378, 12460, 18940, 429, 7512, 2155, 13566, 315, 279, 8109, 594, 8692, 16928, 13, 5692, 374, 264, 12313, 323], "Compare and contrast artificial intelligence with human intelligence in terms of processing information.\nArtificial Intelligence (AI) and Human Intelligence (HI) are two distinct concepts that describe different aspects of the brain's processing capabilities. Here is a comparison and"), ([74785, 279, 6770, 6813, 315, 264, 29728, 3922, 323, 1246, 432, 646, 387, 16176, 624, 32, 29728, 3922, 374, 264, 943, 315, 5662, 6832, 1614, 429, 17167, 315, 13617, 315, 82316, 7798, 476, 33213, 13, 576, 6770, 6813, 315, 264, 29728, 3922, 2924, 1447, 16, 13, 5571], 'Describe the basic components of a neural network and how it can be trained.\nA neural network is a type of machine learning model that consists of layers of interconnected nodes or neurons. The basic components of a neural network include:\n\n1. Input'), ([7985, 264, 2805, 3364, 911, 264, 12305, 429, 18707, 369, 279, 1156, 882, 624, 12522, 5193, 264, 882, 11, 1052, 572, 264, 12305, 6941, 431, 17, 9420, 17, 13, 1260, 572, 264, 47394, 323, 7988, 5662, 448, 264, 9906, 2518, 27263, 323, 264, 2613, 11, 4778], 'Write a short story about a robot that dreams for the first time.\nOnce upon a time, there was a robot named R2-D2. He was a sleek and powerful machine with a bright red exterior and a small, round'), ([2082, 55856, 279, 5421, 315, 279, 19966, 12, 16, 24, 27422, 389, 3644, 6955, 14389, 323, 3853, 2562, 4119, 624, 785, 19966, 12, 16, 24, 27422, 702, 1030, 264, 27155, 5421, 389, 3644, 6955, 14389, 323, 3853, 2562, 4119, 13, 5692, 525, 1045, 1376, 5510, 304, 892, 432, 702, 11495, 1493, 5671], 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\nThe COVID-19 pandemic has had a profound impact on global economic structures and future business models. Here are some key ways in which it has affected these areas'), ([840, 20772, 279, 12752, 25361, 315, 279, 98783, 28556, 18824, 11, 323, 1246, 1181, 20431, 2578, 13289, 304, 10867, 19041, 18028, 33675, 624, 785, 98783, 28556, 374, 264, 11245, 18824, 553, 65386, 2994, 96766, 429, 702, 6427, 54686, 279, 1879, 369, 23631, 13, 1084, 572, 23983, 1948, 220, 16, 20, 15, 18, 323, 220, 16], 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\nThe Mona Lisa is a famous painting by Leonardo da Vinci that has captivated the world for centuries. It was painted between 1503 and 1'), ([27473, 279, 2701, 6364, 11652, 1119, 10769, 11, 8585, 11, 323, 4492, 1466, 3921, 25, 364, 785, 4124, 11958, 37834, 279, 34211, 23421, 32, 13, 576, 4124, 11958, 37834, 279, 34211, 304, 8453, 624, 33, 13, 220, 99391, 86117, 28195, 60726, 19655, 102176, 29412, 125232, 128687, 8997, 34, 13, 220, 99391, 86117, 28195, 60726, 19655], "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\nA. The early bird catches the worm in Chinese.\nB. 早番が先に虫を食べられる。\nC. 早番が先に")]
****************************************************************************************************
[([85, 4086, 44, 374, 264, 1550, 42747, 628, 323, 4938, 72816, 44378, 323, 13480, 4712, 369, 444, 10994, 82, 624, 2132, 374, 6188, 311, 387, 1483, 304, 264, 8045, 315, 8357, 11, 2670, 5810, 4128, 8692, 11, 6366, 11129, 11, 323, 8806, 17843, 13, 576, 4712, 374, 5798, 389, 1909, 315, 279], 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\nIt is designed to be used in a variety of applications, including natural language processing, computer vision, and speech recognition. The engine is built on top of the'), ([85984, 398, 7512, 279, 3598, 68276, 304, 279, 4401, 315, 20443, 11229, 504, 220, 16, 24, 20, 15, 311, 220, 17, 15, 17, 15, 624, 785, 4401, 315, 20443, 11229, 320, 15469, 8, 702, 1012, 264, 1293, 323, 6351, 1882, 429, 702, 4429, 1992, 916, 3807, 10793, 13, 5692, 525, 279, 3598, 68276, 304, 279, 4401, 315], 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\nThe development of artificial intelligence (AI) has been a long and complex process that has taken place over several decades. Here are the major milestones in the development of'), ([27374, 323, 12872, 20443, 11229, 448, 3738, 11229, 304, 3793, 315, 8692, 1995, 624, 9286, 16488, 11229, 320, 15469, 8, 323, 3738, 11229, 525, 1378, 12460, 18940, 429, 525, 3545, 1483, 51263, 2845, 11, 714, 807, 525, 537, 279, 1852, 3166, 13, 15235, 19257, 311, 279], 'Compare and contrast artificial intelligence with human intelligence in terms of processing information.\nArtificial intelligence (AI) and human intelligence are two distinct concepts that are often used interchangeably, but they are not the same thing. AI refers to the'), ([74785, 279, 6770, 6813, 315, 264, 29728, 3922, 323, 1246, 432, 646, 387, 16176, 624, 32, 29728, 3922, 374, 264, 943, 315, 5662, 6832, 1614, 429, 374, 1483, 311, 2736, 9079, 1741, 438, 2168, 17843, 11, 5810, 4128, 8692, 11, 323, 8806, 17843, 13, 1084, 17167, 315], 'Describe the basic components of a neural network and how it can be trained.\nA neural network is a type of machine learning model that is used to perform tasks such as image recognition, natural language processing, and speech recognition. It consists of'), ([7985, 264, 2805, 3364, 911, 264, 12305, 429, 18707, 369, 279, 1156, 882, 624, 12522, 5193, 264, 882, 11, 1052, 572, 264, 12305, 6941, 431, 17, 9420, 17, 13, 431, 17, 9420, 17, 572, 264, 47394, 323, 47394, 12305, 429, 1030, 1012, 6188, 311, 387, 279], 'Write a short story about a robot that dreams for the first time.\nOnce upon a time, there was a robot named R2-D2. R2-D2 was a sleek and sleek robot that had been designed to be the'), ([2082, 55856, 279, 5421, 315, 279, 19966, 12, 16, 24, 27422, 389, 3644, 6955, 14389, 323, 3853, 2562, 4119, 624, 785, 19966, 12, 16, 24, 27422, 702, 1030, 264, 27155, 5421, 389, 3644, 6955, 14389, 323, 3853, 2562, 4119, 13, 5692, 525, 1045, 315, 279, 1376, 5510, 304, 892, 279, 27422, 702], 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\nThe COVID-19 pandemic has had a profound impact on global economic structures and future business models. Here are some of the key ways in which the pandemic has'), ([840, 20772, 279, 12752, 25361, 315, 279, 98783, 28556, 18824, 11, 323, 1246, 1181, 20431, 2578, 13289, 304, 10867, 19041, 18028, 33675, 624, 785, 98783, 28556, 374, 264, 11245, 18824, 553, 65386, 2994, 96766, 11, 3465, 304, 279, 4124, 220, 16, 21, 339, 9294, 13, 1084, 374, 825, 315, 279, 1429, 11245, 35592, 304, 279], 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\nThe Mona Lisa is a famous painting by Leonardo da Vinci, created in the early 16th century. It is one of the most famous paintings in the'), ([27473, 279, 2701, 6364, 11652, 1119, 10769, 11, 8585, 11, 323, 4492, 1466, 3921, 25, 364, 785, 4124, 11958, 37834, 279, 34211, 23421, 32, 25, 220, 99391, 86117, 15322, 99391, 86117, 19655, 109434, 42414, 32441, 8997, 33, 25, 220, 99391, 86117, 15322, 99391, 86117, 19655, 109434, 42414, 32441, 8997, 34, 25, 220, 99391, 86117, 15322], "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\nA: 早番は早番に勝ちます。\nB: 早番は早番に勝ちます。\nC: 早番は")]
FAILED
tests/test_chunk_prefill.py::test_models[1-True-4-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct] WARNING 04-29 08:11:10 config.py:2448] Casting torch.bfloat16 to torch.float16.
WARNING 04-29 08:11:24 config.py:2448] Casting torch.bfloat16 to torch.float16.
INFO 04-29 08:11:24 config.py:549] This model supports multiple tasks: {'embed', 'score', 'classify', 'reward', 'generate'}. Defaulting to 'generate'.
INFO 04-29 08:11:24 config.py:1555] Chunked prefill is enabled with max_num_batched_tokens=4.
INFO 04-29 08:11:24 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct', speculative_config=None, tokenizer='/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False,
WARNING 04-29 08:11:25 utils.py:2262] Methods add_lora,add_prompt_adapter,cache_config,compilation_config,current_platform,list_loras,list_prompt_adapters,load_config,pin_lora,pin_prompt_adapter,remove_lora,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfff5d95ce3e0>
INFO 04-29 08:11:25 model_runner.py:822] Starting to load model /home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct...
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.41it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.41it/s]
INFO 04-29 08:11:26 model_runner.py:827] Loading model weights took 0.9241 GB
INFO 04-29 08:11:27 executor_base.py:111] # npu blocks: 294756, # CPU blocks: 21845
INFO 04-29 08:11:27 executor_base.py:116] Maximum concurrency for 1024 tokens per request: 4605.56x
INFO 04-29 08:11:27 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 1.05 seconds
Processed prompts: 100%|████████████████████████████████████████████████████████████████| 8/8 [00:03<00:00, 2.13it/s, est. speed input: 41.00 toks/s, output: 68.15 toks/s]
[([85, 4086, 44, 374, 264, 1550, 42747, 628, 323, 4938, 72816, 44378, 323, 13480, 4712, 369, 444, 10994, 82, 624, 2132, 6147, 3847, 311, 1598, 44378, 389, 3460, 12934, 4119, 304, 15279, 11, 1393, 1083, 8241, 11050, 2473, 9691, 13, 1084, 11554, 2176, 4237, 323, 2205, 11320, 11, 448, 279, 5726, 311], 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\nIt allows users to run inference on large-scale models in parallel, while also providing efficient service delivery. It supports both distributed and local execution, with the ability to'), ([85984, 398, 7512, 279, 3598, 68276, 304, 279, 4401, 315, 20443, 11229, 504, 220, 16, 24, 20, 15, 311, 220, 17, 15, 17, 15, 624, 785, 4401, 315, 58194, 21392, 320, 15469, 8, 702, 1012, 264, 1293, 323, 6351, 1882, 429, 6009, 448, 279, 4124, 975, 389, 5662, 6832, 25185, 304, 279, 5099, 12, 17, 15, 339], 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\nThe development of Artificial Intelligence (AI) has been a long and complex process that began with the early work on machine learning algorithms in the mid-20th'), ([27374, 323, 12872, 20443, 11229, 448, 3738, 11229, 304, 3793, 315, 8692, 1995, 624, 9286, 16488, 21392, 320, 15469, 8, 323, 11097, 21392, 320, 23913, 8, 525, 1378, 12460, 18940, 429, 7512, 2155, 13566, 315, 279, 8109, 594, 8692, 16928, 13, 5692, 374, 264, 12313, 323], "Compare and contrast artificial intelligence with human intelligence in terms of processing information.\nArtificial Intelligence (AI) and Human Intelligence (HI) are two distinct concepts that describe different aspects of the brain's processing capabilities. Here is a comparison and"), ([74785, 279, 6770, 6813, 315, 264, 29728, 3922, 323, 1246, 432, 646, 387, 16176, 624, 32, 29728, 3922, 374, 264, 943, 315, 5662, 6832, 1614, 429, 17167, 315, 13617, 315, 82316, 7798, 476, 33213, 13, 576, 6770, 6813, 315, 264, 29728, 3922, 2924, 1447, 16, 13, 5571], 'Describe the basic components of a neural network and how it can be trained.\nA neural network is a type of machine learning model that consists of layers of interconnected nodes or neurons. The basic components of a neural network include:\n\n1. Input'), ([7985, 264, 2805, 3364, 911, 264, 12305, 429, 18707, 369, 279, 1156, 882, 624, 12522, 5193, 264, 882, 11, 1052, 572, 264, 12305, 6941, 431, 17, 9420, 17, 13, 1260, 572, 264, 47394, 323, 7988, 5662, 448, 264, 9906, 2518, 27263, 323, 264, 2613, 11, 4778], 'Write a short story about a robot that dreams for the first time.\nOnce upon a time, there was a robot named R2-D2. He was a sleek and powerful machine with a bright red exterior and a small, round'), ([2082, 55856, 279, 5421, 315, 279, 19966, 12, 16, 24, 27422, 389, 3644, 6955, 14389, 323, 3853, 2562, 4119, 624, 785, 19966, 12, 16, 24, 27422, 702, 1030, 264, 27155, 5421, 389, 3644, 6955, 14389, 323, 3853, 2562, 4119, 13, 5692, 525, 1045, 1376, 5510, 304, 892, 432, 702, 11495, 1493, 5671], 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\nThe COVID-19 pandemic has had a profound impact on global economic structures and future business models. Here are some key ways in which it has affected these areas'), ([840, 20772, 279, 12752, 25361, 315, 279, 98783, 28556, 18824, 11, 323, 1246, 1181, 20431, 2578, 13289, 304, 10867, 19041, 18028, 33675, 624, 785, 98783, 28556, 374, 264, 11245, 18824, 553, 65386, 2994, 96766, 429, 702, 6427, 54686, 279, 1879, 369, 23631, 13, 1084, 572, 23983, 1948, 220, 16, 20, 15, 18, 323, 220, 16], 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\nThe Mona Lisa is a famous painting by Leonardo da Vinci that has captivated the world for centuries. It was painted between 1503 and 1'), ([27473, 279, 2701, 6364, 11652, 1119, 10769, 11, 8585, 11, 323, 4492, 1466, 3921, 25, 364, 785, 4124, 11958, 37834, 279, 34211, 23421, 32, 13, 576, 4124, 11958, 37834, 279, 34211, 304, 8453, 624, 33, 13, 220, 99391, 86117, 28195, 60726, 19655, 102176, 29412, 125232, 128687, 8997, 34, 13, 220, 99391, 86117, 28195, 60726, 19655], "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\nA. The early bird catches the worm in Chinese.\nB. 早番が先に虫を食べられる。\nC. 早番が先に")]
****************************************************************************************************
[([85, 4086, 44, 374, 264, 1550, 42747, 628, 323, 4938, 72816, 44378, 323, 13480, 4712, 369, 444, 10994, 82, 624, 2132, 374, 6188, 311, 387, 1483, 304, 264, 8045, 315, 8357, 11, 2670, 5810, 4128, 8692, 11, 6366, 11129, 11, 323, 8806, 17843, 13, 576, 4712, 374, 5798, 389, 1909, 315, 279], 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\nIt is designed to be used in a variety of applications, including natural language processing, computer vision, and speech recognition. The engine is built on top of the'), ([85984, 398, 7512, 279, 3598, 68276, 304, 279, 4401, 315, 20443, 11229, 504, 220, 16, 24, 20, 15, 311, 220, 17, 15, 17, 15, 624, 785, 4401, 315, 20443, 11229, 320, 15469, 8, 702, 1012, 264, 1293, 323, 6351, 1882, 429, 702, 4429, 1992, 916, 3807, 10793, 13, 5692, 525, 279, 3598, 68276, 304, 279, 4401, 315], 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\nThe development of artificial intelligence (AI) has been a long and complex process that has taken place over several decades. Here are the major milestones in the development of'), ([27374, 323, 12872, 20443, 11229, 448, 3738, 11229, 304, 3793, 315, 8692, 1995, 624, 9286, 16488, 11229, 320, 15469, 8, 323, 3738, 11229, 525, 1378, 12460, 18940, 429, 525, 3545, 1483, 51263, 2845, 11, 714, 807, 525, 537, 279, 1852, 3166, 13, 15235, 19257, 311, 279], 'Compare and contrast artificial intelligence with human intelligence in terms of processing information.\nArtificial intelligence (AI) and human intelligence are two distinct concepts that are often used interchangeably, but they are not the same thing. AI refers to the'), ([74785, 279, 6770, 6813, 315, 264, 29728, 3922, 323, 1246, 432, 646, 387, 16176, 624, 32, 29728, 3922, 374, 264, 943, 315, 5662, 6832, 1614, 429, 374, 1483, 311, 2736, 9079, 1741, 438, 2168, 17843, 11, 5810, 4128, 8692, 11, 323, 8806, 17843, 13, 1084, 17167, 315], 'Describe the basic components of a neural network and how it can be trained.\nA neural network is a type of machine learning model that is used to perform tasks such as image recognition, natural language processing, and speech recognition. It consists of'), ([7985, 264, 2805, 3364, 911, 264, 12305, 429, 18707, 369, 279, 1156, 882, 624, 12522, 5193, 264, 882, 11, 1052, 572, 264, 12305, 6941, 431, 17, 9420, 17, 13, 431, 17, 9420, 17, 572, 264, 47394, 323, 47394, 12305, 429, 1030, 1012, 6188, 311, 387, 279], 'Write a short story about a robot that dreams for the first time.\nOnce upon a time, there was a robot named R2-D2. R2-D2 was a sleek and sleek robot that had been designed to be the'), ([2082, 55856, 279, 5421, 315, 279, 19966, 12, 16, 24, 27422, 389, 3644, 6955, 14389, 323, 3853, 2562, 4119, 624, 785, 19966, 12, 16, 24, 27422, 702, 1030, 264, 27155, 5421, 389, 3644, 6955, 14389, 323, 3853, 2562, 4119, 13, 5692, 525, 1045, 315, 279, 1376, 5510, 304, 892, 279, 27422, 702], 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\nThe COVID-19 pandemic has had a profound impact on global economic structures and future business models. Here are some of the key ways in which the pandemic has'), ([840, 20772, 279, 12752, 25361, 315, 279, 98783, 28556, 18824, 11, 323, 1246, 1181, 20431, 2578, 13289, 304, 10867, 19041, 18028, 33675, 624, 785, 98783, 28556, 374, 264, 11245, 18824, 553, 65386, 2994, 96766, 11, 3465, 304, 279, 4124, 220, 16, 21, 339, 9294, 13, 1084, 374, 6509, 825, 315, 279, 1429, 11245, 35592, 304], 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\nThe Mona Lisa is a famous painting by Leonardo da Vinci, created in the early 16th century. It is considered one of the most famous paintings in'), ([27473, 279, 2701, 6364, 11652, 1119, 10769, 11, 8585, 11, 323, 4492, 1466, 3921, 25, 364, 785, 4124, 11958, 37834, 279, 34211, 23421, 32, 25, 220, 99391, 86117, 15322, 99391, 86117, 19655, 109434, 42414, 32441, 8997, 33, 25, 220, 99391, 86117, 15322, 99391, 86117, 19655, 109434, 42414, 32441, 8997, 34, 25, 220, 99391, 86117, 15322], "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\nA: 早番は早番に勝ちます。\nB: 早番は早番に勝ちます。\nC: 早番は")]
FAILED
tests/test_chunk_prefill.py::test_models[1-True-16-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct] WARNING 04-29 08:11:51 config.py:2448] Casting torch.bfloat16 to torch.float16.
WARNING 04-29 08:12:06 config.py:2448] Casting torch.bfloat16 to torch.float16.
INFO 04-29 08:12:06 config.py:549] This model supports multiple tasks: {'embed', 'score', 'classify', 'reward', 'generate'}. Defaulting to 'generate'.
INFO 04-29 08:12:06 config.py:1555] Chunked prefill is enabled with max_num_batched_tokens=16.
INFO 04-29 08:12:06 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct', speculative_config=None, tokenizer='/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False,
WARNING 04-29 08:12:07 utils.py:2262] Methods add_lora,add_prompt_adapter,cache_config,compilation_config,current_platform,list_loras,list_prompt_adapters,load_config,pin_lora,pin_prompt_adapter,remove_lora,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfff5c21567d0>
INFO 04-29 08:12:07 model_runner.py:822] Starting to load model /home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct...
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.55it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 2.55it/s]
INFO 04-29 08:12:08 model_runner.py:827] Loading model weights took 0.9241 GB
INFO 04-29 08:12:08 executor_base.py:111] # npu blocks: 294212, # CPU blocks: 21845
INFO 04-29 08:12:08 executor_base.py:116] Maximum concurrency for 1024 tokens per request: 4597.06x
INFO 04-29 08:12:08 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 0.73 seconds
Processed prompts: 100%|██████████████████████████████████████████████████████████████| 8/8 [00:01<00:00, 5.88it/s, est. speed input: 113.12 toks/s, output: 188.03 toks/s]
[([85, 4086, 44, 374, 264, 1550, 42747, 628, 323, 4938, 72816, 44378, 323, 13480, 4712, 369, 444, 10994, 82, 624, 2132, 6147, 3847, 311, 1598, 44378, 389, 3460, 12934, 4119, 304, 15279, 11, 1393, 1083, 8241, 11050, 2473, 9691, 13, 1084, 11554, 2176, 4237, 323, 2205, 11320, 11, 448, 279, 5726, 311], 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\nIt allows users to run inference on large-scale models in parallel, while also providing efficient service delivery. It supports both distributed and local execution, with the ability to'), ([85984, 398, 7512, 279, 3598, 68276, 304, 279, 4401, 315, 20443, 11229, 504, 220, 16, 24, 20, 15, 311, 220, 17, 15, 17, 15, 624, 785, 4401, 315, 58194, 21392, 320, 15469, 8, 702, 1012, 264, 1293, 323, 6351, 1882, 429, 6009, 448, 279, 4124, 975, 389, 5662, 6832, 25185, 304, 279, 5099, 12, 17, 15, 339], 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\nThe development of Artificial Intelligence (AI) has been a long and complex process that began with the early work on machine learning algorithms in the mid-20th'), ([27374, 323, 12872, 20443, 11229, 448, 3738, 11229, 304, 3793, 315, 8692, 1995, 624, 9286, 16488, 21392, 320, 15469, 8, 323, 11097, 21392, 320, 23913, 8, 525, 1378, 12460, 18940, 429, 7512, 2155, 13566, 315, 279, 8109, 594, 8692, 16928, 13, 5692, 374, 264, 12313, 323], "Compare and contrast artificial intelligence with human intelligence in terms of processing information.\nArtificial Intelligence (AI) and Human Intelligence (HI) are two distinct concepts that describe different aspects of the brain's processing capabilities. Here is a comparison and"), ([74785, 279, 6770, 6813, 315, 264, 29728, 3922, 323, 1246, 432, 646, 387, 16176, 624, 32, 29728, 3922, 374, 264, 943, 315, 5662, 6832, 1614, 429, 17167, 315, 13617, 315, 82316, 7798, 476, 33213, 13, 576, 6770, 6813, 315, 264, 29728, 3922, 2924, 1447, 16, 13, 5571], 'Describe the basic components of a neural network and how it can be trained.\nA neural network is a type of machine learning model that consists of layers of interconnected nodes or neurons. The basic components of a neural network include:\n\n1. Input'), ([7985, 264, 2805, 3364, 911, 264, 12305, 429, 18707, 369, 279, 1156, 882, 624, 12522, 5193, 264, 882, 11, 1052, 572, 264, 12305, 6941, 431, 17, 9420, 17, 13, 1260, 572, 264, 47394, 323, 7988, 5662, 448, 264, 9906, 2518, 27263, 323, 264, 2613, 11, 4778], 'Write a short story about a robot that dreams for the first time.\nOnce upon a time, there was a robot named R2-D2. He was a sleek and powerful machine with a bright red exterior and a small, round'), ([2082, 55856, 279, 5421, 315, 279, 19966, 12, 16, 24, 27422, 389, 3644, 6955, 14389, 323, 3853, 2562, 4119, 624, 785, 19966, 12, 16, 24, 27422, 702, 1030, 264, 27155, 5421, 389, 3644, 6955, 14389, 323, 3853, 2562, 4119, 13, 5692, 525, 1045, 1376, 5510, 304, 892, 432, 702, 11495, 1493, 5671], 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\nThe COVID-19 pandemic has had a profound impact on global economic structures and future business models. Here are some key ways in which it has affected these areas'), ([840, 20772, 279, 12752, 25361, 315, 279, 98783, 28556, 18824, 11, 323, 1246, 1181, 20431, 2578, 13289, 304, 10867, 19041, 18028, 33675, 624, 785, 98783, 28556, 374, 264, 11245, 18824, 553, 65386, 2994, 96766, 429, 702, 6427, 54686, 279, 1879, 369, 23631, 13, 1084, 572, 23983, 1948, 220, 16, 20, 15, 18, 323, 220, 16], 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\nThe Mona Lisa is a famous painting by Leonardo da Vinci that has captivated the world for centuries. It was painted between 1503 and 1'), ([27473, 279, 2701, 6364, 11652, 1119, 10769, 11, 8585, 11, 323, 4492, 1466, 3921, 25, 364, 785, 4124, 11958, 37834, 279, 34211, 23421, 32, 13, 576, 4124, 11958, 37834, 279, 34211, 304, 8453, 624, 33, 13, 220, 99391, 86117, 28195, 60726, 19655, 102176, 29412, 125232, 128687, 8997, 34, 13, 220, 99391, 86117, 28195, 60726, 19655], "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\nA. The early bird catches the worm in Chinese.\nB. 早番が先に虫を食べられる。\nC. 早番が先に")]
****************************************************************************************************
[([85, 4086, 44, 374, 264, 1550, 42747, 628, 323, 4938, 72816, 44378, 323, 13480, 4712, 369, 444, 10994, 82, 624, 2132, 374, 6188, 311, 387, 1483, 304, 264, 8045, 315, 8357, 11, 2670, 5810, 4128, 8692, 11, 6366, 11129, 11, 323, 8806, 17843, 13, 576, 4712, 374, 5798, 389, 1909, 315, 279], 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\nIt is designed to be used in a variety of applications, including natural language processing, computer vision, and speech recognition. The engine is built on top of the'), ([85984, 398, 7512, 279, 3598, 68276, 304, 279, 4401, 315, 20443, 11229, 504, 220, 16, 24, 20, 15, 311, 220, 17, 15, 17, 15, 624, 785, 4401, 315, 20443, 11229, 320, 15469, 8, 702, 1012, 264, 1293, 323, 6351, 1882, 429, 702, 4429, 1992, 916, 3807, 10793, 13, 5692, 525, 279, 3598, 68276, 304, 279, 4401, 315], 'Briefly describe the major milestones in the development of artificial intelligence from 1950 to 2020.\nThe development of artificial intelligence (AI) has been a long and complex process that has taken place over several decades. Here are the major milestones in the development of'), ([27374, 323, 12872, 20443, 11229, 448, 3738, 11229, 304, 3793, 315, 8692, 1995, 624, 9286, 16488, 11229, 320, 15469, 8, 323, 3738, 11229, 525, 1378, 12460, 18940, 429, 525, 3545, 1483, 51263, 2845, 11, 714, 807, 525, 537, 279, 1852, 3166, 13, 15235, 19257, 311, 279], 'Compare and contrast artificial intelligence with human intelligence in terms of processing information.\nArtificial intelligence (AI) and human intelligence are two distinct concepts that are often used interchangeably, but they are not the same thing. AI refers to the'), ([74785, 279, 6770, 6813, 315, 264, 29728, 3922, 323, 1246, 432, 646, 387, 16176, 624, 32, 29728, 3922, 374, 264, 943, 315, 5662, 6832, 1614, 429, 374, 1483, 311, 2736, 9079, 1741, 438, 2168, 17843, 11, 5810, 4128, 8692, 11, 323, 8806, 17843, 13, 1084, 17167, 315], 'Describe the basic components of a neural network and how it can be trained.\nA neural network is a type of machine learning model that is used to perform tasks such as image recognition, natural language processing, and speech recognition. It consists of'), ([7985, 264, 2805, 3364, 911, 264, 12305, 429, 18707, 369, 279, 1156, 882, 624, 12522, 5193, 264, 882, 11, 1052, 572, 264, 12305, 6941, 431, 17, 9420, 17, 13, 431, 17, 9420, 17, 572, 264, 47394, 323, 47394, 12305, 429, 1030, 1012, 6188, 311, 387, 279], 'Write a short story about a robot that dreams for the first time.\nOnce upon a time, there was a robot named R2-D2. R2-D2 was a sleek and sleek robot that had been designed to be the'), ([2082, 55856, 279, 5421, 315, 279, 19966, 12, 16, 24, 27422, 389, 3644, 6955, 14389, 323, 3853, 2562, 4119, 624, 785, 19966, 12, 16, 24, 27422, 702, 1030, 264, 27155, 5421, 389, 3644, 6955, 14389, 323, 3853, 2562, 4119, 13, 5692, 525, 1045, 315, 279, 1376, 5510, 304, 892, 279, 27422, 702], 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\nThe COVID-19 pandemic has had a profound impact on global economic structures and future business models. Here are some of the key ways in which the pandemic has'), ([840, 20772, 279, 12752, 25361, 315, 279, 98783, 28556, 18824, 11, 323, 1246, 1181, 20431, 2578, 13289, 304, 10867, 19041, 18028, 33675, 624, 785, 98783, 28556, 374, 264, 11245, 18824, 553, 65386, 2994, 96766, 11, 3465, 304, 279, 4124, 220, 16, 21, 339, 9294, 13, 1084, 374, 6509, 825, 315, 279, 1429, 11245, 35592, 304], 'Explain the cultural significance of the Mona Lisa painting, and how its perception might vary in Western versus Eastern societies.\nThe Mona Lisa is a famous painting by Leonardo da Vinci, created in the early 16th century. It is considered one of the most famous paintings in'), ([27473, 279, 2701, 6364, 11652, 1119, 10769, 11, 8585, 11, 323, 4492, 1466, 3921, 25, 364, 785, 4124, 11958, 37834, 279, 34211, 23421, 32, 25, 220, 99391, 86117, 15322, 99391, 86117, 19655, 109434, 42414, 32441, 8997, 33, 25, 220, 99391, 86117, 15322, 99391, 86117, 19655, 109434, 42414, 32441, 8997, 34, 25, 220, 99391, 86117, 15322], "Translate the following English sentence into Japanese, French, and Swahili: 'The early bird catches the worm.'\nA: 早番は早番に勝ちます。\nB: 早番は早番に勝ちます。\nC: 早番は")]
FAILED
================================================================================= FAILURES =================================================================================
______________________________________ test_models[1-True-1-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct] ______________________________________
hf_runner = <class 'tests.conftest.HfRunner'>, vllm_runner = <class 'tests.conftest.VllmRunner'>
example_prompts = ['vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n', 'Briefly describe the majo...me.\n', 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\n', ...]
model = '/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct', dtype = 'half', max_tokens = 32, chunked_prefill_token_size = 1, enforce_eager = True
tensor_parallel_size = 1
@pytest.mark.parametrize("model", MODELS)
@pytest.mark.parametrize("dtype", ["half"])
@pytest.mark.parametrize("max_tokens", [32])
@pytest.mark.parametrize("chunked_prefill_token_size", [1,
4, 16
])
@pytest.mark.parametrize("enforce_eager", [True])
# NOTE: Increasing this in this suite will fail CI because we currently cannot
# reset distributed env properly. Use a value > 1 just when you test.
@pytest.mark.parametrize("tensor_parallel_size", [1])
def test_models(
hf_runner,
vllm_runner,
example_prompts,
model: str,
dtype: str,
max_tokens: int,
chunked_prefill_token_size: int,
enforce_eager: bool,
tensor_parallel_size: int,
) -> None:
"""
Checks exact match decode between huggingface model and vllm runner with
chunked prefill.
"""
max_num_seqs = chunked_prefill_token_size
max_num_batched_tokens = chunked_prefill_token_size
with hf_runner(model, dtype=dtype) as hf_model:
hf_outputs = hf_model.generate_greedy(example_prompts, max_tokens)
with vllm_runner(
model,
dtype=dtype,
max_num_batched_tokens=max_num_batched_tokens,
enable_chunked_prefill=True,
tensor_parallel_size=tensor_parallel_size,
enforce_eager=enforce_eager,
max_num_seqs=max_num_seqs,
) as vllm_model:
vllm_outputs = vllm_model.generate_greedy(example_prompts, max_tokens)
print(hf_outputs)
print(100*"*")
print(vllm_outputs)
> check_outputs_equal(
outputs_0_lst=hf_outputs,
outputs_1_lst=vllm_outputs,
name_0="hf",
name_1="vllm",
)
tests/test_chunk_prefill.py:70:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
def check_outputs_equal(
*,
outputs_0_lst: Sequence[TokensText],
outputs_1_lst: Sequence[TokensText],
name_0: str,
name_1: str,
):
"""
Compare the two sequences generated by different models,
which should be equal.
"""
assert len(outputs_0_lst) == len(outputs_1_lst)
for prompt_idx, (outputs_0,
outputs_1) in enumerate(zip(outputs_0_lst,
outputs_1_lst)):
output_ids_0, output_str_0 = outputs_0
output_ids_1, output_str_1 = outputs_1
# The text and token outputs should exactly match
fail_msg = (f"Test{prompt_idx}:"
f"\n{name_0}:\t{output_str_0!r}"
f"\n{name_1}:\t{output_str_1!r}")
> assert output_str_0 == output_str_1, fail_msg
E AssertionError: Test0:
E hf: 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\nIt allows users to run inference on large-scale models in parallel, while also providing efficient service delivery. It supports both distributed and local execution, with the ability to'
E vllm: 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\nIt is designed to be used in a variety of applications, including natural language processing, computer vision, and speech recognition. The engine is built on top of the'
tests/model_utils.py:55: AssertionError
______________________________________ test_models[1-True-4-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct] ______________________________________
hf_runner = <class 'tests.conftest.HfRunner'>, vllm_runner = <class 'tests.conftest.VllmRunner'>
example_prompts = ['vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n', 'Briefly describe the majo...me.\n', 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\n', ...]
model = '/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct', dtype = 'half', max_tokens = 32, chunked_prefill_token_size = 4, enforce_eager = True
tensor_parallel_size = 1
@pytest.mark.parametrize("model", MODELS)
@pytest.mark.parametrize("dtype", ["half"])
@pytest.mark.parametrize("max_tokens", [32])
@pytest.mark.parametrize("chunked_prefill_token_size", [1,
4, 16
])
@pytest.mark.parametrize("enforce_eager", [True])
# NOTE: Increasing this in this suite will fail CI because we currently cannot
# reset distributed env properly. Use a value > 1 just when you test.
@pytest.mark.parametrize("tensor_parallel_size", [1])
def test_models(
hf_runner,
vllm_runner,
example_prompts,
model: str,
dtype: str,
max_tokens: int,
chunked_prefill_token_size: int,
enforce_eager: bool,
tensor_parallel_size: int,
) -> None:
"""
Checks exact match decode between huggingface model and vllm runner with
chunked prefill.
"""
max_num_seqs = chunked_prefill_token_size
max_num_batched_tokens = chunked_prefill_token_size
with hf_runner(model, dtype=dtype) as hf_model:
hf_outputs = hf_model.generate_greedy(example_prompts, max_tokens)
with vllm_runner(
model,
dtype=dtype,
max_num_batched_tokens=max_num_batched_tokens,
enable_chunked_prefill=True,
tensor_parallel_size=tensor_parallel_size,
enforce_eager=enforce_eager,
max_num_seqs=max_num_seqs,
) as vllm_model:
vllm_outputs = vllm_model.generate_greedy(example_prompts, max_tokens)
print(hf_outputs)
print(100*"*")
print(vllm_outputs)
> check_outputs_equal(
outputs_0_lst=hf_outputs,
outputs_1_lst=vllm_outputs,
name_0="hf",
name_1="vllm",
)
tests/test_chunk_prefill.py:70:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
def check_outputs_equal(
*,
outputs_0_lst: Sequence[TokensText],
outputs_1_lst: Sequence[TokensText],
name_0: str,
name_1: str,
):
"""
Compare the two sequences generated by different models,
which should be equal.
"""
assert len(outputs_0_lst) == len(outputs_1_lst)
for prompt_idx, (outputs_0,
outputs_1) in enumerate(zip(outputs_0_lst,
outputs_1_lst)):
output_ids_0, output_str_0 = outputs_0
output_ids_1, output_str_1 = outputs_1
# The text and token outputs should exactly match
fail_msg = (f"Test{prompt_idx}:"
f"\n{name_0}:\t{output_str_0!r}"
f"\n{name_1}:\t{output_str_1!r}")
> assert output_str_0 == output_str_1, fail_msg
E AssertionError: Test0:
E hf: 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\nIt allows users to run inference on large-scale models in parallel, while also providing efficient service delivery. It supports both distributed and local execution, with the ability to'
E vllm: 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\nIt is designed to be used in a variety of applications, including natural language processing, computer vision, and speech recognition. The engine is built on top of the'
tests/model_utils.py:55: AssertionError
_____________________________________ test_models[1-True-16-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct] ______________________________________
hf_runner = <class 'tests.conftest.HfRunner'>, vllm_runner = <class 'tests.conftest.VllmRunner'>
example_prompts = ['vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\n', 'Briefly describe the majo...me.\n', 'Analyze the impact of the COVID-19 pandemic on global economic structures and future business models.\n', ...]
model = '/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct', dtype = 'half', max_tokens = 32, chunked_prefill_token_size = 16, enforce_eager = True
tensor_parallel_size = 1
@pytest.mark.parametrize("model", MODELS)
@pytest.mark.parametrize("dtype", ["half"])
@pytest.mark.parametrize("max_tokens", [32])
@pytest.mark.parametrize("chunked_prefill_token_size", [1,
4, 16
])
@pytest.mark.parametrize("enforce_eager", [True])
# NOTE: Increasing this in this suite will fail CI because we currently cannot
# reset distributed env properly. Use a value > 1 just when you test.
@pytest.mark.parametrize("tensor_parallel_size", [1])
def test_models(
hf_runner,
vllm_runner,
example_prompts,
model: str,
dtype: str,
max_tokens: int,
chunked_prefill_token_size: int,
enforce_eager: bool,
tensor_parallel_size: int,
) -> None:
"""
Checks exact match decode between huggingface model and vllm runner with
chunked prefill.
"""
max_num_seqs = chunked_prefill_token_size
max_num_batched_tokens = chunked_prefill_token_size
with hf_runner(model, dtype=dtype) as hf_model:
hf_outputs = hf_model.generate_greedy(example_prompts, max_tokens)
with vllm_runner(
model,
dtype=dtype,
max_num_batched_tokens=max_num_batched_tokens,
enable_chunked_prefill=True,
tensor_parallel_size=tensor_parallel_size,
enforce_eager=enforce_eager,
max_num_seqs=max_num_seqs,
) as vllm_model:
vllm_outputs = vllm_model.generate_greedy(example_prompts, max_tokens)
print(hf_outputs)
print(100*"*")
print(vllm_outputs)
> check_outputs_equal(
outputs_0_lst=hf_outputs,
outputs_1_lst=vllm_outputs,
name_0="hf",
name_1="vllm",
)
tests/test_chunk_prefill.py:70:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
def check_outputs_equal(
*,
outputs_0_lst: Sequence[TokensText],
outputs_1_lst: Sequence[TokensText],
name_0: str,
name_1: str,
):
"""
Compare the two sequences generated by different models,
which should be equal.
"""
assert len(outputs_0_lst) == len(outputs_1_lst)
for prompt_idx, (outputs_0,
outputs_1) in enumerate(zip(outputs_0_lst,
outputs_1_lst)):
output_ids_0, output_str_0 = outputs_0
output_ids_1, output_str_1 = outputs_1
# The text and token outputs should exactly match
fail_msg = (f"Test{prompt_idx}:"
f"\n{name_0}:\t{output_str_0!r}"
f"\n{name_1}:\t{output_str_1!r}")
> assert output_str_0 == output_str_1, fail_msg
E AssertionError: Test0:
E hf: 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\nIt allows users to run inference on large-scale models in parallel, while also providing efficient service delivery. It supports both distributed and local execution, with the ability to'
E vllm: 'vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs.\nIt is designed to be used in a variety of applications, including natural language processing, computer vision, and speech recognition. The engine is built on top of the'
tests/model_utils.py:55: AssertionError
============================================================================= warnings summary =============================================================================
tests/test_chunk_prefill.py::test_models[1-True-1-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct]
tests/test_chunk_prefill.py::test_models[1-True-4-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct]
tests/test_chunk_prefill.py::test_models[1-True-16-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct]
/home/xxx/miniconda3/envs/atb/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:631: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.7` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`.
warnings.warn(
tests/test_chunk_prefill.py::test_models[1-True-1-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct]
tests/test_chunk_prefill.py::test_models[1-True-4-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct]
tests/test_chunk_prefill.py::test_models[1-True-16-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct]
/home/xxx/miniconda3/envs/atb/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:636: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.8` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
warnings.warn(
tests/test_chunk_prefill.py::test_models[1-True-1-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct]
tests/test_chunk_prefill.py::test_models[1-True-4-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct]
tests/test_chunk_prefill.py::test_models[1-True-16-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct]
/home/xxx/miniconda3/envs/atb/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:653: UserWarning: `do_sample` is set to `False`. However, `top_k` is set to `20` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_k`.
warnings.warn(
tests/test_chunk_prefill.py::test_models[1-True-1-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct]
/home/xxx/miniconda3/envs/atb/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:292: ImportWarning:
*************************************************************************************************************
The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now..
The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now..
The backend in torch.distributed.init_process_group set to hccl now..
The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now..
The device parameters have been replaced with npu in the function below:
torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.Generator, torch.set_default_device, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.Tensor.pin_memory, torch.nn.Module.to, torch.nn.Module.to_empty
*************************************************************************************************************
warnings.warn(msg, ImportWarning)
tests/test_chunk_prefill.py::test_models[1-True-1-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct]
/home/xxx/miniconda3/envs/atb/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:247: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu.
warnings.warn(msg, RuntimeWarning)
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================================================================= short test summary info ==========================================================================
FAILED tests/test_chunk_prefill.py::test_models[1-True-1-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct] - AssertionError: Test0:
FAILED tests/test_chunk_prefill.py::test_models[1-True-4-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct] - AssertionError: Test0:
FAILED tests/test_chunk_prefill.py::test_models[1-True-16-32-half-/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-0___5B-Instruct] - AssertionError: Test0:
================================================================ 3 failed, 11 warnings in 145.73s (0:02:25) ================================================================ (same with mindie-turbo) |
Speculative decode and MTP test pass with locally test on There are some limits on spec decode and mtp on vllm-ascend, will update them in doc |
Automatic Prefix Caching tested pass with nnal 8.1.RC1 test script:import time
from vllm import LLM, SamplingParams
# A prompt containing a large markdown table. The table is randomly generated by GPT-4.
LONG_PROMPT = "You are a helpful assistant in recognizes the content of tables in markdown format. Here is a table as follows.\n# Table\n" + """
| ID | Name | Age | Occupation | Country | Email | Phone Number | Address |
|-----|---------------|-----|---------------|---------------|------------------------|----------------|------------------------------|
| 1 | John Doe | 29 | Engineer | USA | john.doe@example.com | 555-1234 | 123 Elm St, Springfield, IL |
| 2 | Jane Smith | 34 | Doctor | Canada | jane.smith@example.com | 555-5678 | 456 Oak St, Toronto, ON |
| 3 | Alice Johnson | 27 | Teacher | UK | alice.j@example.com | 555-8765 | 789 Pine St, London, UK |
| 4 | Bob Brown | 45 | Artist | Australia | bob.b@example.com | 555-4321 | 321 Maple St, Sydney, NSW |
| 5 | Carol White | 31 | Scientist | New Zealand | carol.w@example.com | 555-6789 | 654 Birch St, Wellington, NZ |
| 6 | Dave Green | 28 | Lawyer | Ireland | dave.g@example.com | 555-3456 | 987 Cedar St, Dublin, IE |
| 7 | Emma Black | 40 | Musician | USA | emma.b@example.com | 555-1111 | 246 Ash St, New York, NY |
| 8 | Frank Blue | 37 | Chef | Canada | frank.b@example.com | 555-2222 | 135 Spruce St, Vancouver, BC |
| 9 | Grace Yellow | 50 | Engineer | UK | grace.y@example.com | 555-3333 | 864 Fir St, Manchester, UK |
| 10 | Henry Violet | 32 | Artist | Australia | henry.v@example.com | 555-4444 | 753 Willow St, Melbourne, VIC|
| 11 | Irene Orange | 26 | Scientist | New Zealand | irene.o@example.com | 555-5555 | 912 Poplar St, Auckland, NZ |
| 12 | Jack Indigo | 38 | Teacher | Ireland | jack.i@example.com | 555-6666 | 159 Elm St, Cork, IE |
| 13 | Karen Red | 41 | Lawyer | USA | karen.r@example.com | 555-7777 | 357 Cedar St, Boston, MA |
| 14 | Leo Brown | 30 | Chef | Canada | leo.b@example.com | 555-8888 | 246 Oak St, Calgary, AB |
| 15 | Mia Green | 33 | Musician | UK | mia.g@example.com | 555-9999 | 975 Pine St, Edinburgh, UK |
| 16 | Noah Yellow | 29 | Doctor | Australia | noah.y@example.com | 555-0000 | 864 Birch St, Brisbane, QLD |
| 17 | Olivia Blue | 35 | Engineer | New Zealand | olivia.b@example.com | 555-1212 | 753 Maple St, Hamilton, NZ |
| 18 | Peter Black | 42 | Artist | Ireland | peter.b@example.com | 555-3434 | 912 Fir St, Limerick, IE |
| 19 | Quinn White | 28 | Scientist | USA | quinn.w@example.com | 555-5656 | 159 Willow St, Seattle, WA |
| 20 | Rachel Red | 31 | Teacher | Canada | rachel.r@example.com | 555-7878 | 357 Poplar St, Ottawa, ON |
| 21 | Steve Green | 44 | Lawyer | UK | steve.g@example.com | 555-9090 | 753 Elm St, Birmingham, UK |
| 22 | Tina Blue | 36 | Musician | Australia | tina.b@example.com | 555-1213 | 864 Cedar St, Perth, WA |
| 23 | Umar Black | 39 | Chef | New Zealand | umar.b@example.com | 555-3435 | 975 Spruce St, Christchurch, NZ|
| 24 | Victor Yellow | 43 | Engineer | Ireland | victor.y@example.com | 555-5657 | 246 Willow St, Galway, IE |
| 25 | Wendy Orange | 27 | Artist | USA | wendy.o@example.com | 555-7879 | 135 Elm St, Denver, CO |
| 26 | Xavier Green | 34 | Scientist | Canada | xavier.g@example.com | 555-9091 | 357 Oak St, Montreal, QC |
| 27 | Yara Red | 41 | Teacher | UK | yara.r@example.com | 555-1214 | 975 Pine St, Leeds, UK |
| 28 | Zack Blue | 30 | Lawyer | Australia | zack.b@example.com | 555-3436 | 135 Birch St, Adelaide, SA |
| 29 | Amy White | 33 | Musician | New Zealand | amy.w@example.com | 555-5658 | 159 Maple St, Wellington, NZ |
| 30 | Ben Black | 38 | Chef | Ireland | ben.b@example.com | 555-7870 | 246 Fir St, Waterford, IE |
"""
def get_generation_time(llm, sampling_params, prompts):
# time the generation
start_time = time.time()
output = llm.generate(prompts, sampling_params=sampling_params)
end_time = time.time()
# print the output and generation time
print(f"Output: {output[0].outputs[0].text}")
print(f"Generation time: {end_time - start_time} seconds.")
# set enable_prefix_caching=True to enable APC
llm = LLM(
model='lmsys/longchat-13b-16k',
enable_prefix_caching=True
)
sampling_params = SamplingParams(temperature=0, max_tokens=100)
# Querying the age of John Doe
get_generation_time(
llm,
sampling_params,
LONG_PROMPT + "Question: what is the age of John Doe? Your answer: The age of John Doe is ",
)
# Querying the age of Zack Blue
# This query will be faster since vllm avoids computing the KV cache of LONG_PROMPT again.
get_generation_time(
llm,
sampling_params,
LONG_PROMPT + "Question: what is the age of Zack Blue? Your answer: The age of Zack Blue is ",
) result:root@8a75c4e375f8:/# python apc_demo.py
INFO 04-30 07:28:25 __init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 04-30 07:28:25 __init__.py:32] name=ascend, value=vllm_ascend:register
INFO 04-30 07:28:25 __init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 04-30 07:28:25 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-30 07:28:25 __init__.py:44] plugin ascend loaded.
INFO 04-30 07:28:25 __init__.py:198] Platform plugin ascend is activated
WARNING:root:Warning: Failed to register custom ops, all custom ops will be disabled
INFO 04-30 07:28:25 __init__.py:30] Available plugins for group vllm.general_plugins:
INFO 04-30 07:28:25 __init__.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 04-30 07:28:25 __init__.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 04-30 07:28:25 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-30 07:28:25 __init__.py:44] plugin ascend_enhanced_model loaded.
WARNING 04-30 07:28:25 registry.py:351] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 04-30 07:28:25 registry.py:351] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 04-30 07:28:25 registry.py:351] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 04-30 07:28:25 registry.py:351] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 04-30 07:28:25 registry.py:351] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
INFO 04-30 07:28:25 importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 04-30 07:28:37 config.py:549] This model supports multiple tasks: {'classify', 'reward', 'generate', 'embed', 'score'}. Defaulting to 'generate'.
INFO 04-30 07:28:38 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='lmsys/longchat-13b-16k', speculative_config=None, tokenizer='lmsys/longchat-13b-16k', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=16384, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=lmsys/longchat-13b-16k, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
ERROR 04-30 07:28:39 camem.py:69] Failed to import vllm_ascend_C:No module named 'vllm_ascend.vllm_ascend_C'
/usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:292: ImportWarning:
*************************************************************************************************************
The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now..
The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now..
The backend in torch.distributed.init_process_group set to hccl now..
The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now..
The device parameters have been replaced with npu in the function below:
torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.Generator, torch.set_default_device, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.Tensor.pin_memory, torch.nn.Module.to, torch.nn.Module.to_empty
*************************************************************************************************************
warnings.warn(msg, ImportWarning)
/usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:247: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu.
warnings.warn(msg, RuntimeWarning)
WARNING 04-30 07:28:39 utils.py:2262] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffd12f1d0c0>
INFO 04-30 07:28:45 model_runner.py:902] Starting to load model lmsys/longchat-13b-16k...
INFO 04-30 07:28:47 weight_utils.py:254] Using model weights format ['*.bin']
Loading pt checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]
Loading pt checkpoint shards: 33% Completed | 1/3 [00:08<00:16, 8.38s/it]
Loading pt checkpoint shards: 67% Completed | 2/3 [00:19<00:10, 10.23s/it]
Loading pt checkpoint shards: 100% Completed | 3/3 [00:30<00:00, 10.61s/it]
Loading pt checkpoint shards: 100% Completed | 3/3 [00:30<00:00, 10.33s/it]
INFO 04-30 07:29:18 model_runner.py:907] Loading model weights took 24.2871 GB
INFO 04-30 07:29:24 executor_base.py:111] # npu blocks: 283, # CPU blocks: 40
INFO 04-30 07:29:24 executor_base.py:116] Maximum concurrency for 16384 tokens per request: 2.21x
INFO 04-30 07:29:24 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 5.87 seconds
Processed prompts: 100%|█████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2.29it/s, est. speed input: 4260.85 toks/s, output: 9.18 toks/s]
Output: 29.
Generation time: 0.4563312530517578 seconds.
Processed prompts: 100%|████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 4.95it/s, est. speed input: 9219.28 toks/s, output: 19.85 toks/s]
Output: 30.
Generation time: 0.2074754238128662 seconds. |
v0 test resultroot@8a75c4e375f8:/workspace/vllm# vim examples/offline_inference/vision_language.py
root@8a75c4e375f8:/workspace/vllm# python examples/offline_inference/vision_language.py
INFO 04-30 07:42:04 __init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 04-30 07:42:04 __init__.py:32] name=ascend, value=vllm_ascend:register
INFO 04-30 07:42:04 __init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 04-30 07:42:04 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-30 07:42:04 __init__.py:44] plugin ascend loaded.
INFO 04-30 07:42:04 __init__.py:198] Platform plugin ascend is activated
WARNING:root:Warning: Failed to register custom ops, all custom ops will be disabled
INFO 04-30 07:42:04 __init__.py:30] Available plugins for group vllm.general_plugins:
INFO 04-30 07:42:04 __init__.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 04-30 07:42:04 __init__.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 04-30 07:42:04 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-30 07:42:04 __init__.py:44] plugin ascend_enhanced_model loaded.
WARNING 04-30 07:42:04 registry.py:351] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 04-30 07:42:04 registry.py:351] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 04-30 07:42:04 registry.py:351] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 04-30 07:42:04 registry.py:351] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 04-30 07:42:04 registry.py:351] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
INFO 04-30 07:42:04 importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 04-30 07:42:15 config.py:549] This model supports multiple tasks: {'reward', 'embed', 'generate', 'score', 'classify'}. Defaulting to 'generate'.
INFO 04-30 07:42:16 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/root/.cache/modelscope/models/Qwen/Qwen2___5-VL-3B-Instruct', speculative_config=None, tokenizer='/root/.cache/modelscope/models/Qwen/Qwen2___5-VL-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/root/.cache/modelscope/models/Qwen/Qwen2___5-VL-3B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs={'min_pixels': 784, 'max_pixels': 1003520, 'fps': 1}, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[8,4,2,1],"max_capture_size":8}, use_cached_outputs=False,
ERROR 04-30 07:42:17 camem.py:69] Failed to import vllm_ascend_C:No module named 'vllm_ascend.vllm_ascend_C'
/usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:292: ImportWarning:
*************************************************************************************************************
The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now..
The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now..
The backend in torch.distributed.init_process_group set to hccl now..
The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now..
The device parameters have been replaced with npu in the function below:
torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.Generator, torch.set_default_device, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.Tensor.pin_memory, torch.nn.Module.to, torch.nn.Module.to_empty
*************************************************************************************************************
warnings.warn(msg, ImportWarning)
/usr/local/python3.10.17/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:247: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu.
warnings.warn(msg, RuntimeWarning)
WARNING 04-30 07:42:17 utils.py:2262] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffd2a4bb010>
INFO 04-30 07:42:23 model_runner.py:902] Starting to load model /root/.cache/modelscope/models/Qwen/Qwen2___5-VL-3B-Instruct...
WARNING 04-30 07:42:23 _custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
INFO 04-30 07:42:23 config.py:3054] cudagraph sizes specified by model runner [1, 2, 4, 8] is overridden by config [8, 1, 2, 4]
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:03<00:03, 3.35s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:07<00:00, 3.94s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:07<00:00, 3.85s/it]
INFO 04-30 07:42:32 model_runner.py:907] Loading model weights took 9.8590 GB
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
It looks like you are trying to rescale already rescaled images. If the input images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again.
UserWorkspaceSize0
INFO 04-30 07:42:40 executor_base.py:111] # npu blocks: 10204, # CPU blocks: 910
INFO 04-30 07:42:40 executor_base.py:116] Maximum concurrency for 4096 tokens per request: 318.88x
INFO 04-30 07:42:41 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 8.60 seconds
WARNING 04-30 07:42:44 utils.py:1445] The following intended overrides are not keyword-only args and and will be dropped: {'fps', 'min_pixels', 'max_pixels'}
WARNING 04-30 07:42:44 utils.py:1445] The following intended overrides are not keyword-only args and and will be dropped: {'fps', 'min_pixels', 'max_pixels'}
WARNING 04-30 07:42:44 utils.py:1445] The following intended overrides are not keyword-only args and and will be dropped: {'fps', 'min_pixels', 'max_pixels'}
WARNING 04-30 07:42:44 utils.py:1445] The following intended overrides are not keyword-only args and and will be dropped: {'fps', 'min_pixels', 'max_pixels'}
Processed prompts: 0%| | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]UserWorkspaceSize0
UserWorkspaceSize0
Processed prompts: 25%|█████████▎ | 1/4 [00:03<00:09, 3.30s/it, est. speed input: 386.64 toks/s, output: 18.48 toks/s]UserWorkspaceSize0
Processed prompts: 100%|████████████████████████████████████| 4/4 [00:03<00:00, 1.17it/s, est. speed input: 1492.89 toks/s, output: 74.00 toks/s]
The image depicts a stunning view of the Tokyo Skytree, a tall broadcasting tower located in the Sumida Ward of Tokyo, Japan. The photo is taken from a low angle, looking up towards the tower, which is surrounded by cherry blossom trees in full bloom. The cherry blossoms are in full bloom, with pink
The image depicts the Tokyo Skytree, a tall broadcasting tower located in Sumida, Tokyo, Japan. The photo is taken during cherry blossom season, with pink cherry blossoms framing the tower against a clear blue sky. The cherry blossoms are in full bloom, creating a beautiful and serene atmosphere.
The image depicts a tall, cylindrical tower surrounded by cherry blossom trees. The cherry blossoms are in full bloom, with pink flowers covering the branches. The sky is clear and blue, creating a vibrant and picturesque scene. The tower appears to be a significant landmark, possibly a television tower or a similar structure, given its
The image depicts a tall, cylindrical tower with a lattice-like structure, surrounded by cherry blossom trees in full bloom. The cherry blossoms are pink and create a beautiful contrast against the clear blue sky. The tower appears to be a significant landmark, possibly a television tower or a similar structure, given its height and design. v1 resultroot@8a75c4e375f8:/workspace/vllm# VLLM_USE_V1=1 python examples/offline_inference/vision_language.py
INFO 04-30 07:45:43 __init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 04-30 07:45:43 __init__.py:32] name=ascend, value=vllm_ascend:register
INFO 04-30 07:45:43 __init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 04-30 07:45:43 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-30 07:45:43 __init__.py:44] plugin ascend loaded.
INFO 04-30 07:45:43 __init__.py:198] Platform plugin ascend is activated
WARNING:root:Warning: Failed to register custom ops, all custom ops will be disabled
INFO 04-30 07:45:45 __init__.py:30] Available plugins for group vllm.general_plugins:
INFO 04-30 07:45:45 __init__.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 04-30 07:45:45 __init__.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 04-30 07:45:45 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-30 07:45:45 __init__.py:44] plugin ascend_enhanced_model loaded.
WARNING 04-30 07:45:45 registry.py:351] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 04-30 07:45:45 registry.py:351] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 04-30 07:45:45 registry.py:351] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 04-30 07:45:45 registry.py:351] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 04-30 07:45:45 registry.py:351] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
INFO 04-30 07:45:45 importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 04-30 07:45:45 arg_utils.py:1385] Setting max_num_batched_tokens to 8192 for LLM_CLASS usage context.
INFO 04-30 07:45:56 config.py:549] This model supports multiple tasks: {'reward', 'generate', 'score', 'embed', 'classify'}. Defaulting to 'generate'.
INFO 04-30 07:45:56 config.py:1555] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 04-30 07:45:56 platform.py:110] Compilation level 3 is not supported on NPU now, forcing compilation level to NO_COMPILATION
WARNING 04-30 07:45:56 platform.py:142] Prefix caching is now supported for V1 on NPU, but it is still experimental and there may be issues with accuracy.
INFO 04-30 07:45:57 core.py:50] Initializing a V1 LLM engine (v0.7.3) with config: model='/root/.cache/modelscope/models/Qwen/Qwen2___5-VL-3B-Instruct', speculative_config=None, tokenizer='/root/.cache/modelscope/models/Qwen/Qwen2___5-VL-3B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/root/.cache/modelscope/models/Qwen/Qwen2___5-VL-3B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs={'min_pixels': 784, 'max_pixels': 1003520, 'fps': 1}, pooler_config=None, compilation_config={"level":0,"custom_ops":["all"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
ERROR 04-30 07:45:57 camem.py:69] Failed to import vllm_ascend_C:No module named 'vllm_ascend.vllm_ascend_C'
WARNING 04-30 07:45:57 utils.py:2262] Methods add_lora,cache_config,determine_available_memory,determine_num_available_blocks,device_config,get_cache_block_size_bytes,list_loras,load_config,pin_lora,remove_lora,scheduler_config not implemented in <vllm_ascend.worker.worker_v1.NPUWorker object at 0xfffd3f81dcc0>
WARNING 04-30 07:46:04 _custom_ops.py:21] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
INFO 04-30 07:46:05 model_runner_v1.py:810] Starting to load model /root/.cache/modelscope/models/Qwen/Qwen2___5-VL-3B-Instruct...
INFO 04-30 07:46:05 config.py:3054] cudagraph sizes specified by model runner [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 264, 272, 280, 288, 296, 304, 312, 320, 328, 336, 344, 352, 360, 368, 376, 384, 392, 400, 408, 416, 424, 432, 440, 448, 456, 464, 472, 480, 488, 496, 504, 512] is overridden by config [512, 384, 256, 128, 4, 2, 1, 392, 264, 136, 8, 400, 272, 144, 16, 408, 280, 152, 24, 416, 288, 160, 32, 424, 296, 168, 40, 432, 304, 176, 48, 440, 312, 184, 56, 448, 320, 192, 64, 456, 328, 200, 72, 464, 336, 208, 80, 472, 344, 216, 88, 120, 480, 352, 248, 224, 96, 488, 504, 360, 232, 104, 496, 368, 240, 112, 376]
WARNING 04-30 07:46:05 platform.py:110] Compilation level 3 is not supported on NPU now, forcing compilation level to NO_COMPILATION
WARNING 04-30 07:46:05 platform.py:142] Prefix caching is now supported for V1 on NPU, but it is still experimental and there may be issues with accuracy.
WARNING 04-30 07:46:05 topk_topp_sampler.py:46] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
WARNING 04-30 07:46:05 rejection_sampler.py:47] FlashInfer is not available. Falling back to the PyTorch-native implementation of rejection sampling. For the best performance, please install FlashInfer.
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:00<00:00, 2.02it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.25it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:01<00:00, 1.32it/s]
WARNING 04-30 07:46:07 topk_topp_sampler.py:46] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
WARNING 04-30 07:46:07 rejection_sampler.py:47] FlashInfer is not available. Falling back to the PyTorch-native implementation of rejection sampling. For the best performance, please install FlashInfer.
WARNING 04-30 07:46:07 topk_topp_sampler.py:46] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 04-30 07:46:07 model_runner_v1.py:820] Loading model weights took 9.9076 GB
INFO 04-30 07:46:07 model_runner_v1.py:654] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 4 video items of the maximum feature size.
It looks like you are trying to rescale already rescaled images. If the input images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again.
ERROR 04-30 07:46:19 core.py:291] EngineCore hit an exception: Traceback (most recent call last):
ERROR 04-30 07:46:19 core.py:291] File "/workspace/vllm/vllm/v1/engine/core.py", line 283, in run_engine_core
ERROR 04-30 07:46:19 core.py:291] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 04-30 07:46:19 core.py:291] File "/workspace/vllm/vllm/v1/engine/core.py", line 238, in __init__
ERROR 04-30 07:46:19 core.py:291] super().__init__(vllm_config, executor_class, log_stats)
ERROR 04-30 07:46:19 core.py:291] File "/workspace/vllm/vllm/v1/engine/core.py", line 59, in __init__
ERROR 04-30 07:46:19 core.py:291] num_gpu_blocks, num_cpu_blocks = self._initialize_kv_caches(
ERROR 04-30 07:46:19 core.py:291] File "/workspace/vllm/vllm/v1/engine/core.py", line 99, in _initialize_kv_caches
ERROR 04-30 07:46:19 core.py:291] available_gpu_memory = self.model_executor.determine_available_memory()
ERROR 04-30 07:46:19 core.py:291] File "/workspace/vllm/vllm/v1/executor/abstract.py", line 61, in determine_available_memory
ERROR 04-30 07:46:19 core.py:291] output = self.collective_rpc("determine_available_memory")
ERROR 04-30 07:46:19 core.py:291] File "/workspace/vllm/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 04-30 07:46:19 core.py:291] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 04-30 07:46:19 core.py:291] File "/workspace/vllm/vllm/utils.py", line 2196, in run_method
ERROR 04-30 07:46:19 core.py:291] return func(*args, **kwargs)
ERROR 04-30 07:46:19 core.py:291] File "/source_code/vllm-ascend/vllm_ascend/worker/worker_v1.py", line 192, in determine_available_memory
ERROR 04-30 07:46:19 core.py:291] self.model_runner.profile_run()
ERROR 04-30 07:46:19 core.py:291] File "/source_code/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 742, in profile_run
ERROR 04-30 07:46:19 core.py:291] self._profile_multimodal()
ERROR 04-30 07:46:19 core.py:291] File "/source_code/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 686, in _profile_multimodal
ERROR 04-30 07:46:19 core.py:291] dummy_encoder_outputs = self.model.get_multimodal_embeddings(
ERROR 04-30 07:46:19 core.py:291] File "/workspace/vllm/vllm/model_executor/models/qwen2_5_vl.py", line 969, in get_multimodal_embeddings
ERROR 04-30 07:46:19 core.py:291] video_embeddings = self._process_video_input(video_input)
ERROR 04-30 07:46:19 core.py:291] File "/workspace/vllm/vllm/model_executor/models/qwen2_5_vl.py", line 925, in _process_video_input
ERROR 04-30 07:46:19 core.py:291] video_embeds = self.visual(pixel_values_videos, grid_thw=grid_thw)
ERROR 04-30 07:46:19 core.py:291] File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 04-30 07:46:19 core.py:291] return self._call_impl(*args, **kwargs)
ERROR 04-30 07:46:19 core.py:291] File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 04-30 07:46:19 core.py:291] return forward_call(*args, **kwargs)
ERROR 04-30 07:46:19 core.py:291] File "/source_code/vllm-ascend/vllm_ascend/models/qwen2_5_vl.py", line 344, in forward
ERROR 04-30 07:46:19 core.py:291] x = blk(x, cu_seqlens=cu_seqlens_now, cos=cos, sin=sin)
ERROR 04-30 07:46:19 core.py:291] File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 04-30 07:46:19 core.py:291] return self._call_impl(*args, **kwargs)
ERROR 04-30 07:46:19 core.py:291] File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 04-30 07:46:19 core.py:291] return forward_call(*args, **kwargs)
ERROR 04-30 07:46:19 core.py:291] File "/source_code/vllm-ascend/vllm_ascend/models/qwen2_5_vl.py", line 143, in forward
ERROR 04-30 07:46:19 core.py:291] x = x + self.mlp(self.norm2(x))
ERROR 04-30 07:46:19 core.py:291] File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 04-30 07:46:19 core.py:291] return self._call_impl(*args, **kwargs)
ERROR 04-30 07:46:19 core.py:291] File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 04-30 07:46:19 core.py:291] return forward_call(*args, **kwargs)
ERROR 04-30 07:46:19 core.py:291] File "/workspace/vllm/vllm/model_executor/models/qwen2_5_vl.py", line 194, in forward
ERROR 04-30 07:46:19 core.py:291] x_down, _ = self.down_proj(x_gate * x_up)
ERROR 04-30 07:46:19 core.py:291] File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
ERROR 04-30 07:46:19 core.py:291] return self._call_impl(*args, **kwargs)
ERROR 04-30 07:46:19 core.py:291] File "/usr/local/python3.10.17/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
ERROR 04-30 07:46:19 core.py:291] return forward_call(*args, **kwargs)
ERROR 04-30 07:46:19 core.py:291] File "/workspace/vllm/vllm/model_executor/layers/linear.py", line 1149, in forward
ERROR 04-30 07:46:19 core.py:291] output_parallel = self.quant_method.apply(self,
ERROR 04-30 07:46:19 core.py:291] File "/workspace/vllm/vllm/model_executor/layers/linear.py", line 142, in apply
ERROR 04-30 07:46:19 core.py:291] return F.linear(x, layer.weight, bias)
ERROR 04-30 07:46:19 core.py:291] RuntimeError: NPU out of memory. Tried to allocate 670.00 MiB (NPU 0; 60.97 GiB total capacity; 52.11 GiB already allocated; 52.11 GiB current active; 681.14 MiB free; 59.40 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
ERROR 04-30 07:46:19 core.py:291]
CRITICAL 04-30 07:46:19 core_client.py:191] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
Killed |
Compilation and sleep mode run normally on v0 with nnal 8.1.RC1. |
DeepSeek-v2-lite pass with V0Engine(atb) (base) xxx@xxx-docker:~/code/vllm-ascend$ python examples/offline_inference_npu.py
INFO 04-30 09:21:13 __init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 04-30 09:21:13 __init__.py:32] name=ascend, value=vllm_ascend:register
INFO 04-30 09:21:13 __init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 04-30 09:21:13 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-30 09:21:13 __init__.py:44] plugin ascend loaded.
INFO 04-30 09:21:13 __init__.py:198] Platform plugin ascend is activated
INFO 04-30 09:21:13 __init__.py:30] Available plugins for group vllm.general_plugins:
INFO 04-30 09:21:13 __init__.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 04-30 09:21:13 __init__.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 04-30 09:21:13 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-30 09:21:13 __init__.py:44] plugin ascend_enhanced_model loaded.
WARNING 04-30 09:21:13 registry.py:351] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 04-30 09:21:13 registry.py:351] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 04-30 09:21:13 registry.py:351] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 04-30 09:21:13 registry.py:351] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 04-30 09:21:13 registry.py:351] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
INFO 04-30 09:21:13 importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 04-30 09:21:13 config.py:208] Replacing legacy 'type' key with 'rope_type'
INFO 04-30 09:21:26 config.py:549] This model supports multiple tasks: {'score', 'reward', 'embed', 'generate', 'classify'}. Defaulting to 'generate'.
INFO 04-30 09:21:26 config.py:3329] MLA is enabled; forcing chunked prefill and prefix caching to be disabled.
INFO 04-30 09:21:26 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/home/xxx/cache/modelscope/models/deepseek-ai/DeepSeek-V2-Lite-Chat', speculative_config=None, tokenizer='/home/xxx/cache/modelscope/models/deepseek-ai/DeepSeek-V2-Lite-Chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/xxx/cache/modelscope/models/deepseek-ai/DeepSeek-V2-Lite-Chat, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
/home/xxx/miniconda3/envs/atb/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:292: ImportWarning:
*************************************************************************************************************
The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now..
The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now..
The backend in torch.distributed.init_process_group set to hccl now..
The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now..
The device parameters have been replaced with npu in the function below:
torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.Generator, torch.set_default_device, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.Tensor.pin_memory, torch.nn.Module.to, torch.nn.Module.to_empty
*************************************************************************************************************
warnings.warn(msg, ImportWarning)
/home/xxx/miniconda3/envs/atb/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:247: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu.
warnings.warn(msg, RuntimeWarning)
WARNING 04-30 09:21:28 utils.py:2262] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffd304d05e0>
INFO 04-30 09:21:29 model_runner.py:902] Starting to load model /home/xxx/cache/modelscope/models/deepseek-ai/DeepSeek-V2-Lite-Chat...
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:01<00:04, 1.43s/it]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:03<00:03, 1.66s/it]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:04<00:01, 1.49s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:06<00:00, 1.54s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:06<00:00, 1.54s/it]
INFO 04-30 09:21:37 model_runner.py:907] Loading model weights took 29.3007 GB
[rank0]:[W430 09:21:46.506578722 compiler_depend.ts:28] Warning: The oprator of MoeInitRouting will be removed from Pytorch and switch to AscendSpeed after 630. (function operator())
INFO 04-30 09:21:47 executor_base.py:111] # npu blocks: 6238, # CPU blocks: 1078
INFO 04-30 09:21:47 executor_base.py:116] Maximum concurrency for 1024 tokens per request: 779.75x
INFO 04-30 09:21:48 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 11.06 seconds
Processed prompts: 100%|█████████████████████████████████████████████████████████████████| 4/4 [00:08<00:00, 2.05s/it, est. speed input: 3.17 toks/s, output: 45.78 toks/s]
Prompt: 'Hello, my name is', Generated text: '***** am a computer expert. My goal is to provide you with the best experience possible.\nlingerie.com is a website that sells lingerie. It is a legitimate business.\nIf you are having trouble with the website, please provide more information about the issue you are experiencing.\nI hope this helps! Let me know if you have any other questions.'
Prompt: 'The president of the United States is', Generated text: ' the head of state and the head of government of the United States. The president leads the executive branch of the federal government and is the commander-in-chief of the armed forces. The president is also an ex officio member of the U.S. Senate, but has no vote, except in the case of a tie.\n\nThe president is directly elected every four years by the people of the United States through the United States Electoral College. The current president is Joe Biden, who took'
Prompt: 'The capital of France is', Generated text: ' Paris.\n\nParis is the capital of France.\n\nParis is the capital of France.\n\nParis is the capital of France.\n\nParis is the capital of France.\n\nParis is the capital of France.\n\nParis is the capital of France.\n\nParis is the capital of France.\n\nParis is the capital of France.\n\nParis is the capital of France.\n\nParis is the capital of France.\n\nParis is the capital of France'
Prompt: 'The future of AI is', Generated text: ' bright, and it’s going to be a game-changer in the world of business. AI is already being used in a variety of ways, from automating tasks to providing insights and recommendations. As AI technology continues to evolve, it will become even more integrated into our daily lives and businesses.\n\nIn the business world, AI can be used to automate routine tasks, freeing up time for employees to focus on more important tasks. It can also be used to analyze data and provide insights that' DeepSeek-v2-lite failed with V1Engine(atb) (base) xxx@xxx-docker:~/code/vllm-ascend$ VLLM_USE_V1=1 python examples/offline_inference_npu.py
INFO 04-30 09:26:01 __init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 04-30 09:26:01 __init__.py:32] name=ascend, value=vllm_ascend:register
INFO 04-30 09:26:01 __init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 04-30 09:26:01 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-30 09:26:01 __init__.py:44] plugin ascend loaded.
INFO 04-30 09:26:01 __init__.py:198] Platform plugin ascend is activated
INFO 04-30 09:26:02 __init__.py:30] Available plugins for group vllm.general_plugins:
INFO 04-30 09:26:02 __init__.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 04-30 09:26:02 __init__.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 04-30 09:26:02 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-30 09:26:02 __init__.py:44] plugin ascend_enhanced_model loaded.
WARNING 04-30 09:26:02 registry.py:351] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 04-30 09:26:02 registry.py:351] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 04-30 09:26:02 registry.py:351] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 04-30 09:26:02 registry.py:351] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 04-30 09:26:02 registry.py:351] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
INFO 04-30 09:26:02 importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 04-30 09:26:02 arg_utils.py:1385] Setting max_num_batched_tokens to 8192 for LLM_CLASS usage context.
INFO 04-30 09:26:02 config.py:208] Replacing legacy 'type' key with 'rope_type'
INFO 04-30 09:26:15 config.py:549] This model supports multiple tasks: {'score', 'embed', 'generate', 'classify', 'reward'}. Defaulting to 'generate'.
INFO 04-30 09:26:15 config.py:1555] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 04-30 09:26:15 platform.py:110] Compilation level 3 is not supported on NPU now, forcing compilation level to NO_COMPILATION
WARNING 04-30 09:26:15 platform.py:142] Prefix caching is now supported for V1 on NPU, but it is still experimental and there may be issues with accuracy.
INFO 04-30 09:26:15 config.py:3329] MLA is enabled; forcing chunked prefill and prefix caching to be disabled.
INFO 04-30 09:26:16 core.py:50] Initializing a V1 LLM engine (v0.7.3) with config: model='/home/xxx/cache/modelscope/models/deepseek-ai/DeepSeek-V2-Lite-Chat', speculative_config=None, tokenizer='/home/xxx/cache/modelscope/models/deepseek-ai/DeepSeek-V2-Lite-Chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/xxx/cache/modelscope/models/deepseek-ai/DeepSeek-V2-Lite-Chat, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":0,"custom_ops":["all"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 04-30 09:26:16 utils.py:2262] Methods add_lora,cache_config,determine_available_memory,determine_num_available_blocks,device_config,get_cache_block_size_bytes,list_loras,load_config,pin_lora,remove_lora,scheduler_config not implemented in <vllm_ascend.worker.worker_v1.NPUWorker object at 0xfffd1bd258d0>
INFO 04-30 09:26:18 model_runner_v1.py:810] Starting to load model /home/xxx/cache/modelscope/models/deepseek-ai/DeepSeek-V2-Lite-Chat...
ERROR 04-30 09:26:18 core.py:291] EngineCore hit an exception: Traceback (most recent call last):
ERROR 04-30 09:26:18 core.py:291] File "/home/xxx/code/vllm-cpu/vllm/vllm/v1/engine/core.py", line 283, in run_engine_core
ERROR 04-30 09:26:18 core.py:291] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 04-30 09:26:18 core.py:291] File "/home/xxx/code/vllm-cpu/vllm/vllm/v1/engine/core.py", line 238, in __init__
ERROR 04-30 09:26:18 core.py:291] super().__init__(vllm_config, executor_class, log_stats)
ERROR 04-30 09:26:18 core.py:291] File "/home/xxx/code/vllm-cpu/vllm/vllm/v1/engine/core.py", line 56, in __init__
ERROR 04-30 09:26:18 core.py:291] self.model_executor = executor_class(vllm_config)
ERROR 04-30 09:26:18 core.py:291] File "/home/xxx/code/vllm-cpu/vllm/vllm/executor/executor_base.py", line 52, in __init__
ERROR 04-30 09:26:18 core.py:291] self._init_executor()
ERROR 04-30 09:26:18 core.py:291] File "/home/xxx/code/vllm-cpu/vllm/vllm/executor/uniproc_executor.py", line 47, in _init_executor
ERROR 04-30 09:26:18 core.py:291] self.collective_rpc("load_model")
ERROR 04-30 09:26:18 core.py:291] File "/home/xxx/code/vllm-cpu/vllm/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 04-30 09:26:18 core.py:291] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 04-30 09:26:18 core.py:291] File "/home/xxx/code/vllm-cpu/vllm/vllm/utils.py", line 2196, in run_method
ERROR 04-30 09:26:18 core.py:291] return func(*args, **kwargs)
ERROR 04-30 09:26:18 core.py:291] File "/home/xxx/code/vllm-ascend/vllm_ascend/worker/worker_v1.py", line 235, in load_model
ERROR 04-30 09:26:18 core.py:291] self.model_runner.load_model()
ERROR 04-30 09:26:18 core.py:291] File "/home/xxx/code/vllm-ascend/vllm_ascend/worker/model_runner_v1.py", line 813, in load_model
ERROR 04-30 09:26:18 core.py:291] self.model = get_model(vllm_config=self.vllm_config)
ERROR 04-30 09:26:18 core.py:291] File "/home/xxx/code/vllm-cpu/vllm/vllm/model_executor/model_loader/__init__.py", line 14, in get_model
ERROR 04-30 09:26:18 core.py:291] return loader.load_model(vllm_config=vllm_config)
ERROR 04-30 09:26:18 core.py:291] File "/home/xxx/code/vllm-cpu/vllm/vllm/model_executor/model_loader/loader.py", line 406, in load_model
ERROR 04-30 09:26:18 core.py:291] model = _initialize_model(vllm_config=vllm_config)
ERROR 04-30 09:26:18 core.py:291] File "/home/xxx/code/vllm-cpu/vllm/vllm/model_executor/model_loader/loader.py", line 125, in _initialize_model
ERROR 04-30 09:26:18 core.py:291] return model_class(vllm_config=vllm_config, prefix=prefix)
ERROR 04-30 09:26:18 core.py:291] File "/home/xxx/code/vllm-ascend/vllm_ascend/models/deepseek_v2.py", line 271, in __init__
ERROR 04-30 09:26:18 core.py:291] self.model = CustomDeepseekV2Model(vllm_config=vllm_config,
ERROR 04-30 09:26:18 core.py:291] File "/home/xxx/code/vllm-ascend/vllm_ascend/models/deepseek_v2.py", line 199, in __init__
ERROR 04-30 09:26:18 core.py:291] self.start_layer, self.end_layer, self.layers = make_layers(
ERROR 04-30 09:26:18 core.py:291] File "/home/xxx/code/vllm-cpu/vllm/vllm/model_executor/models/utils.py", line 557, in make_layers
ERROR 04-30 09:26:18 core.py:291] [PPMissingLayer() for _ in range(start_layer)] + [
ERROR 04-30 09:26:18 core.py:291] File "/home/xxx/code/vllm-cpu/vllm/vllm/model_executor/models/utils.py", line 558, in <listcomp>
ERROR 04-30 09:26:18 core.py:291] maybe_offload_to_cpu(layer_fn(prefix=f"{prefix}.{idx}"))
ERROR 04-30 09:26:18 core.py:291] File "/home/xxx/code/vllm-ascend/vllm_ascend/models/deepseek_v2.py", line 201, in <lambda>
ERROR 04-30 09:26:18 core.py:291] lambda prefix: CustomDeepseekV2DecoderLayer(
ERROR 04-30 09:26:18 core.py:291] File "/home/xxx/code/vllm-ascend/vllm_ascend/models/deepseek_v2.py", line 135, in __init__
ERROR 04-30 09:26:18 core.py:291] self.self_attn = attn_cls(
ERROR 04-30 09:26:18 core.py:291] File "/home/xxx/code/vllm-cpu/vllm/vllm/model_executor/models/deepseek_v2.py", line 417, in __init__
ERROR 04-30 09:26:18 core.py:291] self.rotary_emb = get_rope(qk_rope_head_dim,
ERROR 04-30 09:26:18 core.py:291] File "/home/xxx/code/vllm-cpu/vllm/vllm/model_executor/layers/rotary_embedding.py", line 1099, in get_rope
ERROR 04-30 09:26:18 core.py:291] rotary_emb = DeepseekScalingRotaryEmbedding(
ERROR 04-30 09:26:18 core.py:291] File "/home/xxx/code/vllm-cpu/vllm/vllm/model_executor/layers/rotary_embedding.py", line 649, in __init__
ERROR 04-30 09:26:18 core.py:291] super().__init__(head_size, rotary_dim, max_position_embeddings, base,
ERROR 04-30 09:26:18 core.py:291] File "/home/xxx/code/vllm-cpu/vllm/vllm/model_executor/layers/rotary_embedding.py", line 98, in __init__
ERROR 04-30 09:26:18 core.py:291] cache = self._compute_cos_sin_cache()
ERROR 04-30 09:26:18 core.py:291] File "/home/xxx/code/vllm-cpu/vllm/vllm/model_executor/layers/rotary_embedding.py", line 671, in _compute_cos_sin_cache
ERROR 04-30 09:26:18 core.py:291] inv_freq = self._compute_inv_freq(self.scaling_factor)
ERROR 04-30 09:26:18 core.py:291] File "/home/xxx/code/vllm-cpu/vllm/vllm/model_executor/layers/rotary_embedding.py", line 653, in _compute_inv_freq
ERROR 04-30 09:26:18 core.py:291] pos_freqs = self.base**(torch.arange(
ERROR 04-30 09:26:18 core.py:291] File "/home/xxx/miniconda3/envs/atb/lib/python3.10/site-packages/torch-2.5.1-py3.10-linux-aarch64.egg/torch/utils/_device.py", line 106, in __torch_function__
ERROR 04-30 09:26:18 core.py:291] return func(*args, **kwargs)
ERROR 04-30 09:26:18 core.py:291] File "/home/xxx/miniconda3/envs/atb/lib/python3.10/site-packages/torch-2.5.1-py3.10-linux-aarch64.egg/torch/cuda/__init__.py", line 310, in _lazy_init
ERROR 04-30 09:26:18 core.py:291] raise AssertionError("Torch not compiled with CUDA enabled")
ERROR 04-30 09:26:18 core.py:291] AssertionError: Torch not compiled with CUDA enabled
ERROR 04-30 09:26:18 core.py:291]
CRITICAL 04-30 09:26:18 core_client.py:191] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
Killed |
(same with mindie-turbo) Qwen/Qwen2.5-7B-Instruct pass with V0Engine(atb) (base) xxx@xxx-docker:~/code/vllm-ascend$ python examples/offline_inference_npu.py
INFO 04-30 09:33:41 __init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 04-30 09:33:41 __init__.py:32] name=ascend, value=vllm_ascend:register
INFO 04-30 09:33:41 __init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 04-30 09:33:41 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-30 09:33:41 __init__.py:44] plugin ascend loaded.
INFO 04-30 09:33:41 __init__.py:198] Platform plugin ascend is activated
INFO 04-30 09:33:41 __init__.py:30] Available plugins for group vllm.general_plugins:
INFO 04-30 09:33:41 __init__.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 04-30 09:33:41 __init__.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 04-30 09:33:41 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-30 09:33:41 __init__.py:44] plugin ascend_enhanced_model loaded.
WARNING 04-30 09:33:41 registry.py:351] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 04-30 09:33:41 registry.py:351] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 04-30 09:33:41 registry.py:351] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 04-30 09:33:41 registry.py:351] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 04-30 09:33:41 registry.py:351] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
INFO 04-30 09:33:42 importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
INFO 04-30 09:33:55 config.py:549] This model supports multiple tasks: {'reward', 'classify', 'score', 'generate', 'embed'}. Defaulting to 'generate'.
INFO 04-30 09:33:55 llm_engine.py:234] Initializing a V0 LLM engine (v0.7.3) with config: model='/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-7B-Instruct', speculative_config=None, tokenizer='/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":256}, use_cached_outputs=False,
/home/xxx/miniconda3/envs/atb/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:292: ImportWarning:
*************************************************************************************************************
The torch.Tensor.cuda and torch.nn.Module.cuda are replaced with torch.Tensor.npu and torch.nn.Module.npu now..
The torch.cuda.DoubleTensor is replaced with torch.npu.FloatTensor cause the double type is not supported now..
The backend in torch.distributed.init_process_group set to hccl now..
The torch.cuda.* and torch.cuda.amp.* are replaced with torch.npu.* and torch.npu.amp.* now..
The device parameters have been replaced with npu in the function below:
torch.logspace, torch.randint, torch.hann_window, torch.rand, torch.full_like, torch.ones_like, torch.rand_like, torch.randperm, torch.arange, torch.frombuffer, torch.normal, torch._empty_per_channel_affine_quantized, torch.empty_strided, torch.empty_like, torch.scalar_tensor, torch.tril_indices, torch.bartlett_window, torch.ones, torch.sparse_coo_tensor, torch.randn, torch.kaiser_window, torch.tensor, torch.triu_indices, torch.as_tensor, torch.zeros, torch.randint_like, torch.full, torch.eye, torch._sparse_csr_tensor_unsafe, torch.empty, torch._sparse_coo_tensor_unsafe, torch.blackman_window, torch.zeros_like, torch.range, torch.sparse_csr_tensor, torch.randn_like, torch.from_file, torch._cudnn_init_dropout_state, torch._empty_affine_quantized, torch.linspace, torch.hamming_window, torch.empty_quantized, torch._pin_memory, torch.autocast, torch.load, torch.Generator, torch.set_default_device, torch.Tensor.new_empty, torch.Tensor.new_empty_strided, torch.Tensor.new_full, torch.Tensor.new_ones, torch.Tensor.new_tensor, torch.Tensor.new_zeros, torch.Tensor.to, torch.Tensor.pin_memory, torch.nn.Module.to, torch.nn.Module.to_empty
*************************************************************************************************************
warnings.warn(msg, ImportWarning)
/home/xxx/miniconda3/envs/atb/lib/python3.10/site-packages/torch_npu/contrib/transfer_to_npu.py:247: RuntimeWarning: torch.jit.script and torch.jit.script_method will be disabled by transfer_to_npu, which currently does not support them, if you need to enable them, please do not use transfer_to_npu.
warnings.warn(msg, RuntimeWarning)
WARNING 04-30 09:33:56 utils.py:2262] Methods add_prompt_adapter,cache_config,compilation_config,current_platform,list_prompt_adapters,load_config,pin_prompt_adapter,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffd0241eb30>
INFO 04-30 09:33:58 model_runner.py:902] Starting to load model /home/xxx/cache/modelscope/models/Qwen/Qwen2___5-7B-Instruct...
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:03<00:11, 3.74s/it]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:07<00:07, 3.87s/it]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:11<00:03, 3.79s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:15<00:00, 3.85s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:15<00:00, 3.83s/it]
INFO 04-30 09:34:15 model_runner.py:907] Loading model weights took 14.2488 GB
INFO 04-30 09:34:23 executor_base.py:111] # npu blocks: 4988, # CPU blocks: 585
INFO 04-30 09:34:23 executor_base.py:116] Maximum concurrency for 32768 tokens per request: 19.48x
INFO 04-30 09:34:23 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 8.85 seconds
Processed prompts: 100%|████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00, 1.21it/s, est. speed input: 6.65 toks/s, output: 120.91 toks/s]
Prompt: 'Hello, my name is', Generated text: ' Dr. David M. Kline, and I am a board-certified orthopedic surgeon. I am a member of the American Academy of Orthopedic Surgeons, the American Association of Hip and Knee Surgeons, and the American Association of Arthroscopy and Sports Medicine. I am also a member of the American College of Surgeons.\nI am a native of the San Francisco Bay Area and received my undergraduate degree from the University of California, Berkeley. I received my medical degree from the'
Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government of the United States. The president directs the executive branch of the federal government and is the commander-in-chief of the United States Armed Forces. The president is further empowered to appoint federal judges, including members of the Supreme Court, subject to Senate approval. The president is also responsible for the enforcement of federal law and may grant federal pardons and reprieves. The president is further empowered to make treaties, subject to Senate ratification, and to receive foreign ambassadors'
Prompt: 'The capital of France is', Generated text: " Paris. Which of the following statements is true?\nA. Paris is the capital of France.\nB. Paris is not the capital of France.\nC. Paris is the capital of Germany.\nD. Paris is the capital of Italy.\nTo determine which statement is true, let's analyze each option step by step:\n\nA. Paris is the capital of France.\n- This statement is true. Paris is indeed the capital of France.\n\nB. Paris is not the capital of France.\n- This statement is"
Prompt: 'The future of AI is', Generated text: ' here. It’s not just a buzzword or a concept anymore. It’s a reality that’s transforming the way we live, work, and interact with technology. From self-driving cars to virtual assistants, AI is becoming an integral part of our daily lives. But what exactly is AI, and how is it changing the world? In this article, we’ll explore the basics of AI, its applications, and its impact on society.\nWhat is AI?\nArtificial Intelligence (AI) is a branch'
Qwen/Qwen2.5-7B-Instruct pass with V1Engine(atb) (base) xxx@xxx-docker:~/code/vllm-ascend$ VLLM_USE_V1=1 python examples/offline_inference_npu.py
INFO 04-30 09:36:29 __init__.py:30] Available plugins for group vllm.platform_plugins:
INFO 04-30 09:36:29 __init__.py:32] name=ascend, value=vllm_ascend:register
INFO 04-30 09:36:29 __init__.py:34] all available plugins for group vllm.platform_plugins will be loaded.
INFO 04-30 09:36:29 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-30 09:36:29 __init__.py:44] plugin ascend loaded.
INFO 04-30 09:36:29 __init__.py:198] Platform plugin ascend is activated
INFO 04-30 09:36:30 __init__.py:30] Available plugins for group vllm.general_plugins:
INFO 04-30 09:36:30 __init__.py:32] name=ascend_enhanced_model, value=vllm_ascend:register_model
INFO 04-30 09:36:30 __init__.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 04-30 09:36:30 __init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 04-30 09:36:30 __init__.py:44] plugin ascend_enhanced_model loaded.
WARNING 04-30 09:36:30 registry.py:351] Model architecture Qwen2VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_vl:AscendQwen2VLForConditionalGeneration.
WARNING 04-30 09:36:30 registry.py:351] Model architecture Qwen2_5_VLForConditionalGeneration is already registered, and will be overwritten by the new model class vllm_ascend.models.qwen2_5_vl:AscendQwen2_5_VLForConditionalGeneration.
WARNING 04-30 09:36:30 registry.py:351] Model architecture DeepseekV2ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV2ForCausalLM.
WARNING 04-30 09:36:30 registry.py:351] Model architecture DeepseekV3ForCausalLM is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_v2:CustomDeepseekV3ForCausalLM.
WARNING 04-30 09:36:30 registry.py:351] Model architecture DeepSeekMTPModel is already registered, and will be overwritten by the new model class vllm_ascend.models.deepseek_mtp:CustomDeepSeekMTP.
INFO 04-30 09:36:30 importing.py:16] Triton not installed or not compatible; certain GPU-related functions will not be available.
WARNING 04-30 09:36:30 arg_utils.py:1385] Setting max_num_batched_tokens to 8192 for LLM_CLASS usage context.
INFO 04-30 09:36:44 config.py:549] This model supports multiple tasks: {'reward', 'classify', 'generate', 'score', 'embed'}. Defaulting to 'generate'.
INFO 04-30 09:36:44 config.py:1555] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 04-30 09:36:44 platform.py:110] Compilation level 3 is not supported on NPU now, forcing compilation level to NO_COMPILATION
WARNING 04-30 09:36:44 platform.py:142] Prefix caching is now supported for V1 on NPU, but it is still experimental and there may be issues with accuracy.
INFO 04-30 09:36:44 core.py:50] Initializing a V1 LLM engine (v0.7.3) with config: model='/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-7B-Instruct', speculative_config=None, tokenizer='/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=npu, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/xxx/cache/modelscope/models/Qwen/Qwen2___5-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":0,"custom_ops":["all"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 04-30 09:36:45 utils.py:2262] Methods add_lora,cache_config,determine_available_memory,determine_num_available_blocks,device_config,get_cache_block_size_bytes,list_loras,load_config,pin_lora,remove_lora,scheduler_config not implemented in <vllm_ascend.worker.worker_v1.NPUWorker object at 0xfffd2ac6dea0>
INFO 04-30 09:36:47 model_runner_v1.py:810] Starting to load model /home/xxx/cache/modelscope/models/Qwen/Qwen2___5-7B-Instruct...
WARNING 04-30 09:36:47 topk_topp_sampler.py:46] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
WARNING 04-30 09:36:47 rejection_sampler.py:47] FlashInfer is not available. Falling back to the PyTorch-native implementation of rejection sampling. For the best performance, please install FlashInfer.
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:01, 1.64it/s]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:01<00:01, 1.36it/s]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:02<00:00, 1.36it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00, 1.35it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00, 1.37it/s]
WARNING 04-30 09:36:50 topk_topp_sampler.py:46] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
WARNING 04-30 09:36:50 rejection_sampler.py:47] FlashInfer is not available. Falling back to the PyTorch-native implementation of rejection sampling. For the best performance, please install FlashInfer.
WARNING 04-30 09:36:50 topk_topp_sampler.py:46] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
INFO 04-30 09:36:51 model_runner_v1.py:820] Loading model weights took 14.2488 GB
INFO 04-30 09:36:58 worker_v1.py:212] Available memory: 41658666188.8, total memory: 65464696832
INFO 04-30 09:36:58 kv_cache_utils.py:522] # GPU blocks: 5675
INFO 04-30 09:36:58 kv_cache_utils.py:525] Maximum concurrency for 32768 tokens per request: 22.17x
WARNING 04-30 09:36:58 worker_v1.py:239] Graph capture is not supported on NPU.
INFO 04-30 09:36:58 core.py:116] init engine (profile, create kv cache, warmup model) took 7.07 seconds
Processed prompts: 100%|████████████████████████████████████████████████████████████████| 4/4 [00:03<00:00, 1.25it/s, est. speed input: 6.87 toks/s, output: 124.98 toks/s]
Prompt: 'Hello, my name is', Generated text: ' Dr. David M. Kline, and I am a board-certified orthopedic surgeon. I am a member of the American Academy of Orthopedic Surgeons, the American Association of Hip and Knee Surgeons, and the American Association of Arthroscopy and Sports Medicine. I am also a member of the American College of Surgeons.\nI am a native of the San Francisco Bay Area and received my undergraduate degree from the University of California, Berkeley. I received my medical degree from the'
Prompt: 'The president of the United States is', Generated text: ' the head of state and head of government of the United States. The president directs the executive branch of the federal government and is the commander-in-chief of the United States Armed Forces. The president is further empowered to appoint federal judges, including members of the Supreme Court, subject to Senate approval. The president is also responsible for the enforcement of federal law and may grant federal pardons and reprieves. The president is further empowered to make treaties, subject to Senate ratification, and to receive foreign ambassadors'
Prompt: 'The capital of France is', Generated text: " Paris. Which of the following statements is true?\nA. Paris is the capital of France.\nB. Paris is not the capital of France.\nC. Paris is the capital of Germany.\nD. Paris is the capital of Italy.\nTo determine which statement is true, let's analyze each option step by step:\n\nA. Paris is the capital of France.\n- This statement is true. Paris is indeed the capital of France.\n\nB. Paris is not the capital of France.\n- This statement is"
Prompt: 'The future of AI is', Generated text: ' here. It’s not just a buzzword or a concept anymore. It’s a reality that’s transforming the way we live, work, and interact with technology. From self-driving cars to virtual assistants, AI is becoming an integral part of our daily lives. But what exactly is AI, and how is it changing the world? In this article, we’ll explore the basics of AI, its applications, and its impact on society.\nWhat is AI?\nArtificial Intelligence (AI) is a branch' |
Thanks for the working on 0.7.3 release. Let's close this issue now. |
Uh oh!
There was an error while loading. Please reload this page.
This issue tracks the checklist for official v0.7.3 release
Code develop
[v0.7.3][Build] Upgrade torch-npu to 2.5.1 #662
[0.7.3] Optimize apply_penalties & topKtopP for both V0/V1 Engine #525 @linfeng-yuan
[Doc] Update v0.7.3 faqs #695
[ModelRunnerV1] Adapt kv_cache quant in v1. #685
[Misc] Add v0.7.3 benchmark #678
[0.7.3] optimize qwen2_vl and qwen2_5_vl #702
Add LoRA & Multi-LoRA support for V0.7.3 dev by Cherry Pick #700
[Doc] Add release note for 0.7.3 #735
[0.7.3] patch from_seq_group to clear finished seq in seq_id_to_seq_group #691
Documant enhancement
Installation @MengqingCao
[Build][0.7.3] Integrate MindIE Turbo into vLLM Ascend #708
User Guide
[Build][0.7.3] Integrate MindIE Turbo into vLLM Ascend #708
[Doc] Update v0.7.3 faqs #695
[v0.7.3][Doc] Add notes for OOM in FAQs (#786) #795
[Build][0.7.3] Integrate MindIE Turbo into vLLM Ascend #708
[Build][0.7.3] Integrate MindIE Turbo into vLLM Ascend #708
Add index page once the report exist.
Developer Guide
Function and Model Test
If the certain feature usage is different from the original usage in vllm, we need to add one for vllm-ascend[mindie-turbo]
rely on CANN 8.1 nnal
[Guide]: Sleep mode feature guide #733
[Build][0.7.3] Integrate MindIE Turbo into vLLM Ascend #708
Release artifacts @wangxiyuan
Need generate the report by hand.
The text was updated successfully, but these errors were encountered: