[Bugfix][ROCm] Use `chunked_prefill_paged_decode` as fallback for V1 attention on ROCm #18093

kliuae · 2025-05-13T18:42:58Z

On ROCm, vLLM’s V1 engine uses the unified attention kernel as its sole attention backend. However, at the moment this kernel fails when running models where the number of query heads over the number of key-value heads is not a power of two. This makes models like Llama-4-Scout, whose num_queries_per_kv evaluates to an odd number, fail to run and yield the following error on ROCm:

offs_m = tl.arange(0, BLOCK_Q * num_queries_per_kv)
         ^
ValueError: arange's range must be a power of 2

This PR addresses this issue by adding back the chunked_prefill_paged_decode kernel as a fallback for cases where the input tensor shapes are incompatible with the unified attention kernel.

Signed-off-by: kf <kuanfu.liu@embeddedllm.com>

github-actions · 2025-05-13T18:43:07Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

tjtanaa · 2025-05-14T10:10:45Z

@hongxiayang

This is another quick solution to issue #18088
Falling back to chunked_prefill_paged_decode could be a quick fix as it has been extensively used in previous llama4 use cases.

hongxiayang · 2025-05-14T15:16:53Z

cc @tdoublep

hongxiayang · 2025-05-14T15:55:44Z

@hongxiayang

This is another quick solution to issue #18088 Falling back to chunked_prefill_paged_decode could be a quick fix as it has been extensively used in previous llama4 use cases.

Thanks. If using chunked_prefill_paged_decode has better performance than the unified triton attention kernel on ROCm side, we can go with this

tjtanaa · 2025-05-15T06:39:50Z

@hongxiayang @tdoublep

meta-llama/Llama-4-Scout-17B-16E-Instruct -tp 4 --max-model-len 32768 --max_seq_len_to_capture 32768 --no-enable-prefix-caching --max-num-batched-tokens 32768

1	2	3
*pad the `BLOCK_Q num_queries_per_kv` with offset mask**	*pad the `BLOCK_Q num_queries_per_kv` without offset mask**	fallback to previously used kernels `chunked_prefill_paged_decode`
https://github.com/EmbeddedLLM/vllm/tree/fix-unified-attention-triton	PR #18100	PR #18093

The best solution is to fallback

The correctness of all three approaches have been validated by running lm_eval on GSM8K on both Llama4 and Mixtral model.

tjtanaa · 2025-05-15T09:06:33Z

lm_eval of this branch:

[2025-05-15 08:15:06] INFO evaluation_tracker.py:272: Output path not provided, skipping saving results aggregated
vllm (pretrained=meta-llama/Llama-4-Scout-17B-16E-Instruct,tensor_parallel_size=8,max_model_len=10000,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
gsm8k	3	flexible-extract	5	exact_match	↑	0.9189	±	0.0075
		strict-match	5	exact_match	↑	0.9014	±	0.0082

hongxiayang · 2025-05-15T13:48:09Z

I am fine to have both PRs to address the original issue related to the relatively new unified triton attention: (1) one for the completeness of the unified triton attention to address edge cases of power of 2 issue (#18100), and (2) another to address performance regression on ROCm (this PR).

ProExpertProg · 2025-05-15T16:12:23Z

Can we disable auto-merge until we confirm that the performance is better on LLama4 after the unified_attention fix that landed this morning? #18161

hongxiayang · 2025-05-15T18:55:50Z

@ProExpertProg the updated benchmarking result was posted to the slack chat (the fall-back option still perform better in summary than the updated unified triton attention fix): here is the screen shot:

https://files.slack.com/files-pri/T07QH46AC91-F08SXGHN0TT/image.png

tdoublep · 2025-05-15T19:02:45Z

I'm OK with merging this. Will try to figure out why the unified kernel is not performant in this case.

kliuae added 2 commits May 13, 2025 16:24

fallback v1 FA to chunked_prefill_paged_decode

6ee8222

Signed-off-by: kf <kuanfu.liu@embeddedllm.com>

fallback v1 FA to chunked_prefill_paged_decode

93952cb

Signed-off-by: kf <kuanfu.liu@embeddedllm.com>

kliuae requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners May 13, 2025 18:42

mergify bot added the v1 label May 13, 2025

houseroad added the rocm Related to AMD ROCm label May 14, 2025

tjtanaa mentioned this pull request May 15, 2025

[Bug]: Triton Error in multiproc_executor.py when running llama4 on ROCm #18088

Closed

1 task

hongxiayang approved these changes May 15, 2025

View reviewed changes

DarkLight1337 approved these changes May 15, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) May 15, 2025 16:05

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label May 15, 2025

DarkLight1337 added this to the v0.9.0 milestone May 15, 2025

DarkLight1337 disabled auto-merge May 15, 2025 16:14

tjtanaa mentioned this pull request May 15, 2025

[Bugfix] [ROCm]: Remove assertion logic when using AITER fused moe in unquantizedMethod to reenable LLama4 BF16 #18205

Merged

vllm-bot merged commit ee659e3 into vllm-project:main May 16, 2025
86 of 90 checks passed

gshtras mentioned this pull request May 16, 2025

[Attention][V1] Toggle for v1 attention backend #18275

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix][ROCm] Use `chunked_prefill_paged_decode` as fallback for V1 attention on ROCm #18093

[Bugfix][ROCm] Use `chunked_prefill_paged_decode` as fallback for V1 attention on ROCm #18093

kliuae commented May 13, 2025 •

edited by github-actions bot

Loading

github-actions bot commented May 13, 2025

tjtanaa commented May 14, 2025

hongxiayang commented May 14, 2025

hongxiayang commented May 14, 2025

tjtanaa commented May 15, 2025

tjtanaa commented May 15, 2025

hongxiayang commented May 15, 2025

ProExpertProg commented May 15, 2025

hongxiayang commented May 15, 2025

tdoublep commented May 15, 2025

[Bugfix][ROCm] Use chunked_prefill_paged_decode as fallback for V1 attention on ROCm #18093

[Bugfix][ROCm] Use chunked_prefill_paged_decode as fallback for V1 attention on ROCm #18093

Conversation

kliuae commented May 13, 2025 • edited by github-actions bot Loading

github-actions bot commented May 13, 2025

tjtanaa commented May 14, 2025

hongxiayang commented May 14, 2025

hongxiayang commented May 14, 2025

tjtanaa commented May 15, 2025

tjtanaa commented May 15, 2025

hongxiayang commented May 15, 2025

ProExpertProg commented May 15, 2025

hongxiayang commented May 15, 2025

tdoublep commented May 15, 2025

[Bugfix][ROCm] Use `chunked_prefill_paged_decode` as fallback for V1 attention on ROCm #18093

[Bugfix][ROCm] Use `chunked_prefill_paged_decode` as fallback for V1 attention on ROCm #18093

kliuae commented May 13, 2025 •

edited by github-actions bot

Loading