-
-
Notifications
You must be signed in to change notification settings - Fork 7.5k
Refactor pplx init logic to make it modular (prepare for deepep) #18200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
@varun-sundar-rabindranath @bnellnm please help review, thx! |
@@ -158,7 +158,6 @@ def check_and_update_config(cls, vllm_config: "VllmConfig") -> None: | |||
"currently not supported with CUDA Graphs.") | |||
vllm_config.model_config.enforce_eager = True | |||
compilation_config.use_cudagraph = False | |||
compilation_config.use_inductor = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this now will break things if eager mode is not used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although, vllm_config.model_config.enforce_eager = True
can be removed. I didn't want to land the PR with that just in case there were other issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does vllm_config.model_config.enforce_eager = True
cover us?
Actually do things break if we set --enforce-eager
but also make torch.compile happen with the compilation config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Afaict, if the inductor is on, then it'll break no matter what other options are set. But, cudagraphs + eager backend work just fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's be sure that removing the compilation_config.use_inductor = False
doesn't break anything - otherwise lgtm
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
The procedure now: during distributed environment initializationonly for EP group with expert parallel enabled, cuda communicator creates the manager based on after model is created
|
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
def dispatch( | ||
self, hidden_states: torch.Tensor, | ||
router_logits: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]: | ||
assert self.all2all_impl is not None | ||
hidden_states, router_logits = self.all2all_impl.dispatch( | ||
assert self.all2all_manager is not None | ||
hidden_states, router_logits = self.all2all_manager.dispatch( | ||
hidden_states, router_logits) | ||
return hidden_states, router_logits | ||
|
||
def combine(self, hidden_states: torch.Tensor) -> torch.Tensor: | ||
assert self.all2all_impl is not None | ||
hidden_states = self.all2all_impl.combine(hidden_states) | ||
assert self.all2all_manager is not None | ||
hidden_states = self.all2all_manager.combine(hidden_states) | ||
return hidden_states |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would these methods work if we weren't using the naive manager? e.g. the pplx all2all object might have a different instance for each layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes so the dispatch/combine
in the DeviceCommunicatorBase
is not used for pplx kernel, and I agree with your prepare/finalize call inside every layer now. I will try to remove dispatch/combine
in the DeviceCommunicatorBase
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to remove dispatch/combine in the DeviceCommunicatorBase
this would be in a future PR, and we need to have prepare_finalize
for the naive all2all implementation, then we can remove these functions in DeviceCommunicatorBase
def select_gemm_impl( | ||
self, prepare_finalize: Optional[FusedMoEPrepareAndFinalize] | ||
) -> FusedMoEPermuteExpertsUnpermute: | ||
# based on the all2all implementation, select the appropriate | ||
# gemm implementation | ||
raise NotImplementedError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there should be some sort of error logging here so it's obvious that the combination of pplx + particular MoE implementation is not supported. Or maybe in init_prepare_finalize
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why here? this is the base class, and it just shows the interface. selection of gemm kernels for pplx is at UnquantizedFusedMoEMethod.select_gemm_impl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't necessarily have to be here but it would be nice to get more than a NotImplementedError
exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added comments in 5b4095b
Signed-off-by: youkaichao <youkaichao@gmail.com>
This pull request has merge conflicts that must be resolved before it can be |
follow-up after #15956 , refactor pplx-related logic to make it modular.