-
-
Notifications
You must be signed in to change notification settings - Fork 7.5k
[Bugfix][ROCm] Use chunked_prefill_paged_decode
as fallback for V1 attention on ROCm
#18093
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix][ROCm] Use chunked_prefill_paged_decode
as fallback for V1 attention on ROCm
#18093
Conversation
Signed-off-by: kf <kuanfu.liu@embeddedllm.com>
Signed-off-by: kf <kuanfu.liu@embeddedllm.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This is another quick solution to issue #18088 |
cc @tdoublep |
Thanks. If using chunked_prefill_paged_decode has better performance than the unified triton attention kernel on ROCm side, we can go with this |
meta-llama/Llama-4-Scout-17B-16E-Instruct -tp 4 --max-model-len 32768 --max_seq_len_to_capture 32768 --no-enable-prefix-caching --max-num-batched-tokens 32768
The best solution is to fallback The correctness of all three approaches have been validated by running lm_eval on GSM8K on both Llama4 and Mixtral model. |
lm_eval of this branch: [2025-05-15 08:15:06] INFO evaluation_tracker.py:272: Output path not provided, skipping saving results aggregated
|
I am fine to have both PRs to address the original issue related to the relatively new unified triton attention: (1) one for the completeness of the unified triton attention to address edge cases of power of 2 issue (#18100), and (2) another to address performance regression on ROCm (this PR). |
Can we disable auto-merge until we confirm that the performance is better on LLama4 after the |
@ProExpertProg the updated benchmarking result was posted to the slack chat (the fall-back option still perform better in summary than the updated unified triton attention fix): here is the screen shot: https://files.slack.com/files-pri/T07QH46AC91-F08SXGHN0TT/image.png |
I'm OK with merging this. Will try to figure out why the unified kernel is not performant in this case. |
On ROCm, vLLM’s V1 engine uses the unified attention kernel as its sole attention backend. However, at the moment this kernel fails when running models where the number of query heads over the number of key-value heads is not a power of two. This makes models like Llama-4-Scout, whose
num_queries_per_kv
evaluates to an odd number, fail to run and yield the following error on ROCm:This PR addresses this issue by adding back the
chunked_prefill_paged_decode
kernel as a fallback for cases where the input tensor shapes are incompatible with the unified attention kernel.