-
-
Notifications
You must be signed in to change notification settings - Fork 7.5k
Fix Whisper crash caused by invalid max_num_batched_tokens
config
#17853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
This looks reasonable to me, but pinging @NickLucche @ywang96 just to be sure. |
Signed-off-by: inkcherry <mingzhi.liu@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into! Left one comment.
I am not sure this is what is happening with the bug though, as max-model-len (448) * max-num-seqs (2) is still below the default max_num_batched_tokens (5120).
One other important thing is that whisper max-model-len is referring to the decoder transcription length.
# Ensure max_num_batched_tokens does not exceed model limit. | ||
# Some models (e.g., Whisper) have embeddings tied to max length. | ||
self.max_num_batched_tokens = min( | ||
self.max_num_seqs * self.max_model_len, | ||
self.max_num_batched_tokens) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we should only warn the user rather than silently set max_num_batched_tokens
.
Also checking the limit below right after it was upper bounded here seems wasteful doesn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the review!
The crash occurs during the memory profiling stage., the model performs an execution using max_num_batched_tokens/ max_num_seqs
per seq, but the length may exceed the embedding position limit, FYI https://github.com/vllm-project/vllm/blob/376786fac1fc50e8d788a39a91fa28d1709ad48b/vllm/model_executor/models/whisper.py#L416C7-L416C59. Therefore, we should ensure that max_num_batched_tokens < max_model_len*num_seqs
-
for default settings, we take the minimum value to ensure safety. (Note: satisfying this clipping condition typically requires both
max_num_seqs
andmax_model_len
to be small, this does not affect the vast majority of use cases.) -
for user-defined settings: I’ve replaced the error check with a warning instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. This should in fact not effect other models.
I was worrying this could have an impact on a future enc-dec support for v1, but this is not today's problem. Thanks!
…ig (vllm-project#17853) Signed-off-by: inkcherry <mingzhi.liu@intel.com> Signed-off-by: 汪志鹏 <wangzhipeng628@gmail.com>
…ig (vllm-project#17853) Signed-off-by: inkcherry <mingzhi.liu@intel.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>
…ig (vllm-project#17853) Signed-off-by: inkcherry <mingzhi.liu@intel.com>
fix #17797