Skip to content

[Bug]: Triton Error in multiproc_executor.py when running llama4 on ROCm #18088

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
tjtanaa opened this issue May 13, 2025 · 5 comments
Closed
1 task done
Labels
bug Something isn't working rocm Related to AMD ROCm

Comments

@tjtanaa
Copy link
Contributor

tjtanaa commented May 13, 2025

Your current environment

The output of python collect_env.py
PyTorch version: 2.7.0+rocm6.4.1.lw.git6fd40786
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 6.4.43483-e0d58c107

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.26.4
Libc version: glibc-2.35

Python version: 3.10.12 (main, Feb  4 2025, 14:57:36) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-116-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: AMD Instinct MI300X (gfx942:sramecc+:xnack-)
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: 6.4.43483
MIOpen runtime version: 3.4.0
Is XNNPACK available: True


Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] pytorch-triton-rocm==3.3.0+rocm6.4.1.gitef7e8700
[pip3] pyzmq==26.4.0
[pip3] torch==2.7.0+rocm6.4.1.lw.git6fd40786
[pip3] torchao==0.10.0
[pip3] torchaudio==2.7.0+rocm6.4.1.git654fee8f
[pip3] torchvision==0.22.0+rocm6.4.1.git9eb57cd5
[pip3] transformers==4.51.3
[pip3] triton==3.3.0
[conda] Could not collect
ROCM Version: 6.4.43483-e0d58c107
Neuron SDK Version: N/A
vLLM Version: 0.1.dev6387+g74b8ddf.d20250513 (git sha: 74b8ddf, date: 20250513)
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Enabled; Neuron: Disabled


TORCHINDUCTOR_MAX_AUTOTUNE_POINTWISE=1
NCCL_MIN_NCHANNELS=112
TORCHINDUCTOR_MAX_AUTOTUNE=1
PYTORCH_ROCM_ARCH=gfx90a;gfx942
TORCH_BLAS_PREFER_HIPBLASLT=1
LD_LIBRARY_PATH=/opt/rocm/aotriton/lib:/opt/rocm/lib:
VLLM_USE_TRITON_FLASH_ATTN=0
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

🐛 Describe the bug

When running the following script

# SPDX-License-Identifier: Apache-2.0
import argparse
import dataclasses

# from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
from vllm.engine.arg_utils import EngineArgs
from vllm.utils import FlexibleArgumentParser


def main(args: argparse.Namespace):
    print(args)

    engine_args = EngineArgs.from_cli_args(args)

    # NOTE(woosuk): If the request cannot be processed in a single batch,
    # the engine will automatically process the request in multiple batches.
    llm = LLM(**dataclasses.asdict(engine_args))

    sampling_params = SamplingParams(
        n=args.n,
        temperature=0,
        top_p=1.0,
        ignore_eos=True,
        max_tokens=args.output_len,
    )
    print(sampling_params)

    # tokenizer = AutoTokenizer.from_pretrained(engine_args.model)
    # inputs = tokenizer('Hello, world!', return_tensors='pt').input_ids
    inputs = [
        "Hello, my name is",
        "The president of the United States is",
        ("1 + " * 50) + " 1 = ",  # Longer prompt.
        "The capital of France is",
    ]
    # Prompt 0: 'Hello, my name is',
    # Generated text: ' John and I am a 30-year-old man from the United States. I am a software engineer by profession and I have been working in the tech industry for about 5 years now. I am married to a wonderful woman named Sarah, and we have two beautiful children together. We live in a cozy little house in the suburbs, and we love spending time outdoors and exploring new places.\n\nI am a bit of a introvert and I enjoy spending time alone, reading books, watching movies, and playing video games. I am also a bit of a foodie and I love trying out new recipes and experimenting with different cuisines. I'   # noqa: E501
    # Prompt 1: 'The president of the United States is',
    # Generated text: ' the head of state and head of government of the United States. The president directs the executive branch of the federal government and is the commander-in-chief of the United States Armed Forces.\nThe president is elected by the people through the Electoral College to a four-year term, and is one of only two nationally elected federal officers, the other being the Vice President of the United States. The Twenty-second Amendment to the United States Constitution prohibits anyone from being elected to the presidency more than twice.\nThe president is both the head of state and head of government of the United States, and is the leader of the executive branch of the federal government. The president'   # noqa: E501
    # Prompt 2: '1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 +  1 = ',   # noqa: E501
    # Generated text: "50\nThe answer is 50.<|start_header_id|>assistant<|end_header_id|>\n\nThat's correct!\n\nYou added 50 ones together, and the result is indeed 50. Well done!\n\nWould you like to try another math problem?<|start_header_id|>assistant<|end_header_id|>\n\nI can generate a new problem for you. Here it is:\n\n2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 = ?\n\nCan you add up all the"   # noqa: E501
    # Prompt 3: 'The capital of France is',
    # Generated text: " a city of love, art, fashion, and cuisine. Paris, the City of Light, is a must-visit destination for anyone who appreciates beauty, history, and culture. From the iconic Eiffel Tower to the world-famous Louvre Museum, there's no shortage of things to see and do in this incredible city.\nHere are some of the top attractions and experiences to add to your Parisian itinerary:\n1. The Eiffel Tower: This iconic iron lattice tower is a symbol of Paris and one of the most recognizable landmarks in the world. Take the elevator to the top for breathtaking views of the city.\n2"   # noqa: E501

    outputs = llm.generate(inputs, sampling_params)
    for i, output in enumerate(outputs):
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt {i}: {prompt!r}, Generated text: {generated_text!r}")
    # print(tokenizer.decode(outputs[0]))


if __name__ == '__main__':
    parser = FlexibleArgumentParser(
        description='Benchmark the latency of processing a single batch of '
        'requests till completion.')
    parser.add_argument('--input-len', type=int, default=32)
    parser.add_argument('--output-len', type=int, default=128)
    parser.add_argument('--batch-size', type=int, default=8)
    parser.add_argument('--n',
                        type=int,
                        default=1,
                        help='Number of generated sequences per prompt.')
    parser.add_argument('--use-beam-search', action='store_true')
    parser.add_argument('--num-iters-warmup',
                        type=int,
                        default=10,
                        help='Number of iterations to run for warmup.')
    parser.add_argument('--num-iters',
                        type=int,
                        default=30,
                        help='Number of iterations to run.')
    parser.add_argument(
        '--profile',
        action='store_true',
        help='profile the generation process of a single batch')
    parser.add_argument(
        '--profile-result-dir',
        type=str,
        default=None,
        help=('path to save the pytorch profiler output. Can be visualized '
              'with ui.perfetto.dev or Tensorboard.'))
    parser.add_argument(
        '--output-json',
        type=str,
        default=None,
        help='Path to save the latency results in JSON format.')

    parser = EngineArgs.add_cli_args(parser)
    args = parser.parse_args()
    main(args)
echo "meta-llama/Llama-4-Scout-17B-16E-Instruct aiter v1"
VLLM_USE_V1=1 \
VLLM_USE_TRITON_FLASH_ATTN=0 \
VLLM_ROCM_USE_AITER=1 \
VLLM_RPC_TIMEOUT=18000 \
VLLM_ROCM_USE_AITER_RMSNORM=0 \
SAFETENSORS_FAST_GPU=1 \
python3 test_accuracy.py \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct -tp 2 --gpu_memory_utilization 0.95 \
--max-model-len 10000 --no-enable-chunked-prefill \
--input-len 32 --output-len 128 --batch-size 8 --n 1 --num-iters-warmup 10 --num-iters 10
```


It throws the following error:
```console
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522] Traceback (most recent call last):
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/triton/language/core.py", line 34, in wrapper
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return fn(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/triton/language/core.py", line 1451, in arange
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return semantic.arange(start, end, _builder)
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/triton/language/semantic.py", line 623, in arange
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     raise ValueError("arange's range must be a power of 2")
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522] ValueError: arange's range must be a power of 2
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522] 
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522] The above exception was the direct cause of the following exception:
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522] 
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522] Traceback (most recent call last):
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/app/aiter-ck-moe-2-stage/vllm/v1/executor/multiproc_executor.py", line 517, in worker_busy_loop
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     output = func(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return func(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/app/aiter-ck-moe-2-stage/vllm/v1/worker/gpu_worker.py", line 276, in execute_model
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     output = self.model_runner.execute_model(scheduler_output,
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return func(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/app/aiter-ck-moe-2-stage/vllm/v1/worker/gpu_model_runner.py", line 1156, in execute_model
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     model_output = self.model(
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return self._call_impl(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return forward_call(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/app/aiter-ck-moe-2-stage/vllm/model_executor/models/mllama4.py", line 768, in forward
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return self.language_model(input_ids, positions, intermediate_tensors,
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return self._call_impl(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return forward_call(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/app/aiter-ck-moe-2-stage/vllm/model_executor/models/llama.py", line 558, in forward
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     model_output = self.model(input_ids, positions, intermediate_tensors,
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/app/aiter-ck-moe-2-stage/vllm/compilation/decorators.py", line 245, in __call__
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     model_output = self.forward(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/app/aiter-ck-moe-2-stage/vllm/model_executor/models/llama.py", line 345, in forward
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     def forward(
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return self._call_impl(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return forward_call(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return fn(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/fx/graph_module.py", line 830, in call_wrapped
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return self._wrapped_call(self, *args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/fx/graph_module.py", line 406, in __call__
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     raise e
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/fx/graph_module.py", line 393, in __call__
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return self._call_impl(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return forward_call(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "<eval_with_key>.98", line 446, in forward
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     submod_1 = self.submod_1(getitem, s0, getitem_1, getitem_2, getitem_3);  getitem = getitem_1 = getitem_2 = submod_1 = None
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/fx/graph_module.py", line 830, in call_wrapped
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return self._wrapped_call(self, *args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/fx/graph_module.py", line 406, in __call__
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     raise e
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/fx/graph_module.py", line 393, in __call__
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return self._call_impl(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return forward_call(*args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "<eval_with_key>.2", line 5, in forward
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     unified_attention_with_output = torch.ops.vllm.unified_attention_with_output(query_2, key_2, value, output_1, 'language_model.model.layers.0.self_attn.attn');  query_2 = key_2 = value = output_1 = unified_attention_with_output = None
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1158, in __call__
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return self._op(*args, **(kwargs or {}))
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/app/aiter-ck-moe-2-stage/vllm/attention/layer.py", line 425, in unified_attention_with_output
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     self.impl.forward(self,
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/app/aiter-ck-moe-2-stage/vllm/v1/attention/backends/triton_attn.py", line 201, in forward
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     unified_attention(
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/app/aiter-ck-moe-2-stage/vllm/attention/ops/triton_unified_attention.py", line 294, in unified_attention
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     kernel_unified_attention_2d[(
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 348, in <lambda>
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 569, in run
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     kernel = self.compile(src, target=target, options=options.__dict__)
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 278, in compile
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     module = src.make_ir(options, codegen_fns, module_map, context)
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 81, in make_ir
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522] triton.compiler.errors.CompilationError: at 67:13:
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     q_block_local_idx = q_block_global_idx - q_block_start_idx
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522] 
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     cur_batch_in_all_start_index = tl.load(query_start_len_ptr + seq_idx)
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     cur_batch_in_all_stop_index = tl.load(query_start_len_ptr + seq_idx + 1)
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522] 
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     cur_batch_query_len = cur_batch_in_all_stop_index \
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]         - cur_batch_in_all_start_index
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522] 
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     if q_block_local_idx * BLOCK_Q >= cur_batch_query_len:
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]         return
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522] 
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     offs_m = tl.arange(0, BLOCK_Q * num_queries_per_kv)
�[1;36m(VllmWorker rank=1 pid=239571)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]              ^
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522] WorkerProc hit an exception.
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522] Traceback (most recent call last):
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/triton/language/core.py", line 34, in wrapper
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return fn(*args, **kwargs)
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/triton/language/core.py", line 1451, in arange
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return semantic.arange(start, end, _builder)
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/triton/language/semantic.py", line 623, in arange
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     raise ValueError("arange's range must be a power of 2")
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522] ValueError: arange's range must be a power of 2
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522] 
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522] The above exception was the direct cause of the following exception:
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522] 
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522] Traceback (most recent call last):
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/app/aiter-ck-moe-2-stage/vllm/v1/executor/multiproc_executor.py", line 517, in worker_busy_loop
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     output = func(*args, **kwargs)
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return func(*args, **kwargs)
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/app/aiter-ck-moe-2-stage/vllm/v1/worker/gpu_worker.py", line 276, in execute_model
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     output = self.model_runner.execute_model(scheduler_output,
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return func(*args, **kwargs)
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/app/aiter-ck-moe-2-stage/vllm/v1/worker/gpu_model_runner.py", line 1156, in execute_model
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     model_output = self.model(
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return self._call_impl(*args, **kwargs)
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return forward_call(*args, **kwargs)
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/app/aiter-ck-moe-2-stage/vllm/model_executor/models/mllama4.py", line 768, in forward
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return self.language_model(input_ids, positions, intermediate_tensors,
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return self._call_impl(*args, **kwargs)
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return forward_call(*args, **kwargs)
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/app/aiter-ck-moe-2-stage/vllm/model_executor/models/llama.py", line 558, in forward
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     model_output = self.model(input_ids, positions, intermediate_tensors,
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/app/aiter-ck-moe-2-stage/vllm/compilation/decorators.py", line 245, in __call__
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     model_output = self.forward(*args, **kwargs)
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/app/aiter-ck-moe-2-stage/vllm/model_executor/models/llama.py", line 345, in forward
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     def forward(
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return self._call_impl(*args, **kwargs)
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return forward_call(*args, **kwargs)
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/_dynamo/eval_frame.py", line 838, in _fn
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return fn(*args, **kwargs)
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/fx/graph_module.py", line 830, in call_wrapped
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return self._wrapped_call(self, *args, **kwargs)
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/fx/graph_module.py", line 406, in __call__
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     raise e
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/fx/graph_module.py", line 393, in __call__
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return self._call_impl(*args, **kwargs)
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return forward_call(*args, **kwargs)
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "<eval_with_key>.98", line 446, in forward
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     submod_1 = self.submod_1(getitem, s0, getitem_1, getitem_2, getitem_3);  getitem = getitem_1 = getitem_2 = submod_1 = None
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/fx/graph_module.py", line 830, in call_wrapped
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return self._wrapped_call(self, *args, **kwargs)
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/fx/graph_module.py", line 406, in __call__
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     raise e
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/fx/graph_module.py", line 393, in __call__
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1751, in _wrapped_call_impl
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return self._call_impl(*args, **kwargs)
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1762, in _call_impl
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return forward_call(*args, **kwargs)
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "<eval_with_key>.2", line 5, in forward
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     unified_attention_with_output = torch.ops.vllm.unified_attention_with_output(query_2, key_2, value, output_1, 'language_model.model.layers.0.self_attn.attn');  query_2 = key_2 = value = output_1 = unified_attention_with_output = None
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/torch/_ops.py", line 1158, in __call__
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return self._op(*args, **(kwargs or {}))
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/app/aiter-ck-moe-2-stage/vllm/attention/layer.py", line 425, in unified_attention_with_output
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     self.impl.forward(self,
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/app/aiter-ck-moe-2-stage/vllm/v1/attention/backends/triton_attn.py", line 201, in forward
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     unified_attention(
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/app/aiter-ck-moe-2-stage/vllm/attention/ops/triton_unified_attention.py", line 294, in unified_attention
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     kernel_unified_attention_2d[(
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 348, in <lambda>
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 569, in run
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     kernel = self.compile(src, target=target, options=options.__dict__)
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 278, in compile
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     module = src.make_ir(options, codegen_fns, module_map, context)
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]   File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 81, in make_ir
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522] triton.compiler.errors.CompilationError: at 67:13:
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     q_block_local_idx = q_block_global_idx - q_block_start_idx
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522] 
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     cur_batch_in_all_start_index = tl.load(query_start_len_ptr + seq_idx)
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     cur_batch_in_all_stop_index = tl.load(query_start_len_ptr + seq_idx + 1)
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522] 
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     cur_batch_query_len = cur_batch_in_all_stop_index \
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]         - cur_batch_in_all_start_index
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522] 
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     if q_block_local_idx * BLOCK_Q >= cur_batch_query_len:
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]         return
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522] 
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]     offs_m = tl.arange(0, BLOCK_Q * num_queries_per_kv)
�[1;36m(VllmWorker rank=0 pid=239570)�[0;0m ERROR 05-13 16:11:49 [multiproc_executor.py:522]              ^
ERROR 05-13 16:11:49 [dump_input.py:68] Dumping input data
ERROR 05-13 16:11:49 [dump_input.py:70] V1 LLM engine (v0.1.dev6387+g74b8ddf.d20250513) with config: model='meta-llama/Llama-4-Scout-17B-16E-Instruct', speculative_config=None, tokenizer='meta-llama/Llama-4-Scout-17B-16E-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=10000, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=meta-llama/Llama-4-Scout-17B-16E-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level": 3, "custom_ops": ["none"], "splitting_ops": ["vllm.unified_attention", "vllm.unified_attention_with_output"], "compile_sizes": [], "inductor_compile_config": {"enable_auto_functionalized_v2": false}, "use_cudagraph": true, "cudagraph_num_of_warmups": 1, "cudagraph_capture_sizes": [512, 504, 496, 488, 480, 472, 464, 456, 448, 440, 432, 424, 416, 408, 400, 392, 384, 376, 368, 360, 352, 344, 336, 328, 320, 312, 304, 296, 288, 280, 272, 264, 256, 248, 240, 232, 224, 216, 208, 200, 192, 184, 176, 168, 160, 152, 144, 136, 128, 120, 112, 104, 96, 88, 80, 72, 64, 56, 48, 40, 32, 24, 16, 8, 4, 2, 1], "max_capture_size": 512}, 
ERROR 05-13 16:11:49 [dump_input.py:78] Dumping scheduler output for model execution:
ERROR 05-13 16:11:49 [dump_input.py:79] SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=0,prompt_token_ids_len=6,mm_inputs=[],mm_hashes=[],mm_positions=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=True, max_tokens=128, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None),block_ids=[1],num_computed_tokens=0,lora_request=None)],scheduled_cached_reqs=[],num_scheduled_tokens={0: 6},total_num_scheduled_tokens=6,scheduled_spec_decode_tokens={},scheduled_encoder_inputs={},num_common_prefix_blocks=1,finished_req_ids=[],free_encoder_input_ids=[],structured_output_request_ids={},grammar_bitmask=null,kv_connector_metadata=null)
ERROR 05-13 16:11:49 [core.py:432] EngineCore encountered a fatal error.
ERROR 05-13 16:11:49 [core.py:432] Traceback (most recent call last):
ERROR 05-13 16:11:49 [core.py:432]   File "/app/aiter-ck-moe-2-stage/vllm/v1/engine/core.py", line 423, in run_engine_core
ERROR 05-13 16:11:49 [core.py:432]     engine_core.run_busy_loop()
ERROR 05-13 16:11:49 [core.py:432]   File "/app/aiter-ck-moe-2-stage/vllm/v1/engine/core.py", line 447, in run_busy_loop
ERROR 05-13 16:11:49 [core.py:432]     self._process_engine_step()
ERROR 05-13 16:11:49 [core.py:432]   File "/app/aiter-ck-moe-2-stage/vllm/v1/engine/core.py", line 472, in _process_engine_step
ERROR 05-13 16:11:49 [core.py:432]     outputs = self.step_fn()
ERROR 05-13 16:11:49 [core.py:432]   File "/app/aiter-ck-moe-2-stage/vllm/v1/engine/core.py", line 226, in step
ERROR 05-13 16:11:49 [core.py:432]     model_output = self.execute_model(scheduler_output)
ERROR 05-13 16:11:49 [core.py:432]   File "/app/aiter-ck-moe-2-stage/vllm/v1/engine/core.py", line 213, in execute_model
ERROR 05-13 16:11:49 [core.py:432]     raise err
ERROR 05-13 16:11:49 [core.py:432]   File "/app/aiter-ck-moe-2-stage/vllm/v1/engine/core.py", line 207, in execute_model
ERROR 05-13 16:11:49 [core.py:432]     return self.model_executor.execute_model(scheduler_output)
ERROR 05-13 16:11:49 [core.py:432]   File "/app/aiter-ck-moe-2-stage/vllm/v1/executor/multiproc_executor.py", line 158, in execute_model
ERROR 05-13 16:11:49 [core.py:432]     (output, ) = self.collective_rpc("execute_model",
ERROR 05-13 16:11:49 [core.py:432]   File "/app/aiter-ck-moe-2-stage/vllm/v1/executor/multiproc_executor.py", line 215, in collective_rpc
ERROR 05-13 16:11:49 [core.py:432]     result = get_response(w, dequeue_timeout)
ERROR 05-13 16:11:49 [core.py:432]   File "/app/aiter-ck-moe-2-stage/vllm/v1/executor/multiproc_executor.py", line 202, in get_response
ERROR 05-13 16:11:49 [core.py:432]     raise RuntimeError(
ERROR 05-13 16:11:49 [core.py:432] RuntimeError: Worker failed with error 'at 67:13:
ERROR 05-13 16:11:49 [core.py:432]     q_block_local_idx = q_block_global_idx - q_block_start_idx
ERROR 05-13 16:11:49 [core.py:432] 
ERROR 05-13 16:11:49 [core.py:432]     cur_batch_in_all_start_index = tl.load(query_start_len_ptr + seq_idx)
ERROR 05-13 16:11:49 [core.py:432]     cur_batch_in_all_stop_index = tl.load(query_start_len_ptr + seq_idx + 1)
ERROR 05-13 16:11:49 [core.py:432] 
ERROR 05-13 16:11:49 [core.py:432]     cur_batch_query_len = cur_batch_in_all_stop_index \
ERROR 05-13 16:11:49 [core.py:432]         - cur_batch_in_all_start_index
ERROR 05-13 16:11:49 [core.py:432] 
ERROR 05-13 16:11:49 [core.py:432]     if q_block_local_idx * BLOCK_Q >= cur_batch_query_len:
ERROR 05-13 16:11:49 [core.py:432]         return
ERROR 05-13 16:11:49 [core.py:432] 
ERROR 05-13 16:11:49 [core.py:432]     offs_m = tl.arange(0, BLOCK_Q * num_queries_per_kv)
ERROR 05-13 16:11:49 [core.py:432]              ^', please check the stack trace above for the root cause
Traceback (most recent call last):
  File "/app/vllmtests/test_accuracy.py", line 92, in <module>
    main(args)
  File "/app/vllmtests/test_accuracy.py", line 46, in main
    outputs = llm.generate(inputs, sampling_params)
  File "/app/aiter-ck-moe-2-stage/vllm/utils.py", line 1207, in inner
    return fn(*args, **kwargs)
  File "/app/aiter-ck-moe-2-stage/vllm/entrypoints/llm.py", line 479, in generate
    outputs = self._run_engine(use_tqdm=use_tqdm)
  File "/app/aiter-ck-moe-2-stage/vllm/entrypoints/llm.py", line 1464, in _run_engine
    step_outputs = self.llm_engine.step()
  File "/app/aiter-ck-moe-2-stage/vllm/v1/engine/llm_engine.py", line 223, in step
    outputs = self.engine_core.get_output()
  File "/app/aiter-ck-moe-2-stage/vllm/v1/engine/core_client.py", line 575, in get_output
    raise self._format_exception(outputs) from None
vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@tjtanaa tjtanaa added the bug Something isn't working label May 13, 2025
@hongxiayang
Copy link
Collaborator

Triton 3.3 is ok. Check this triton in this nightly image: rocm/vllm-dev:nightly_main_20250512

Name: triton
Version: 3.3.0+git981e987e
Summary: A language and compiler for custom Deep Learning operations
Home-page: https://github.com/triton-lang/triton/
Author: Philippe Tillet
Author-email: phil@openai.com
License: 
Location: /usr/local/lib/python3.12/dist-packages
Requires: setuptools
Required-by: xgrammar

@hongxiayang hongxiayang added the rocm Related to AMD ROCm label May 13, 2025
@hongxiayang
Copy link
Collaborator

actually, it is a padding issue . #16828.
It may need something like: triton.next_power_of_2(BLOCK_Q * num_queries_per_kv) in triton_unified_attention.py.

@hongxiayang
Copy link
Collaborator

cc @tdoublep

@tjtanaa
Copy link
Contributor Author

tjtanaa commented May 15, 2025

@hongxiayang @tdoublep

meta-llama/Llama-4-Scout-17B-16E-Instruct -tp 4 --max-model-len 32768 --max_seq_len_to_capture 32768 --no-enable-prefix-caching --max-num-batched-tokens 32768

Image

1 2 3
pad the BLOCK_Q * num_queries_per_kv with offset mask pad the BLOCK_Q * num_queries_per_kv without offset mask fallback to previously used kernels chunked_prefill_paged_decode
https://github.com/EmbeddedLLM/vllm/tree/fix-unified-attention-triton PR #18100 PR #18093

The best solution is to fallback

The correctness of all three approaches have been validated by running lm_eval on GSM8K on both Llama4 and Mixtral model.

@tjtanaa
Copy link
Contributor Author

tjtanaa commented May 16, 2025

Resolved by #18093

@tjtanaa tjtanaa closed this as completed May 16, 2025
@github-project-automation github-project-automation bot moved this from In progress to Done in Llama-4 Issues & Bugs May 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working rocm Related to AMD ROCm
Projects
Status: Done
Development

No branches or pull requests

2 participants