-
-
Notifications
You must be signed in to change notification settings - Fork 7.5k
[Bugfix] [ROCm]: Remove assertion logic when using AITER fused moe in unquantizedMethod to reenable LLama4 BF16 #18205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] [ROCm]: Remove assertion logic when using AITER fused moe in unquantizedMethod to reenable LLama4 BF16 #18205
Conversation
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Resolved by: #18161 |
@LucasWilkinson PR #18161 is about performance regression. It does not resolve this issue addressed in the PR. |
@LucasWilkinson Please take a look at this error log. I am using the latest vLLM commit that contains the PR that you shared. in Llama4.
|
Apologies! Misread the PR, that is my bad! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, sorry for the confusion! Got my threads crossed
I will take note of sharing some error traces to avoid confusion in future PR. 😄 |
Ideally can you please add the gsm8k (or even better mmlu) accuracy results using the AITER backend here? Just to confirm the weight is being applied correctly? |
@LucasWilkinson vllm (pretrained=meta-llama/Llama-4-Scout-17B-16E-Instruct,tensor_parallel_size=8,max_model_len=10000,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|
… unquantizedMethod to reenable LLama4 BF16 (vllm-project#18205) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: Gregory Shtrasberg <Gregory.Shtrasberg@amd.com>
Remove the assert introduced in this PR #15956 to remove the blockage for running llama4 bf16 model with AITER fused moe.
@bnellnm May we know what was the intention behind of adding the assertions on the
rocm_aiter_fused_moe
code path?Error when running llama4