Refactor pplx init logic to make it modular (prepare for deepep) #18200

youkaichao · 2025-05-15T11:04:53Z

follow-up after #15956 , refactor pplx-related logic to make it modular.

Signed-off-by: youkaichao <youkaichao@gmail.com>

github-actions · 2025-05-15T11:05:03Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Signed-off-by: youkaichao <youkaichao@gmail.com>

simon-mo · 2025-05-15T16:15:35Z

@varun-sundar-rabindranath @bnellnm please help review, thx!

bnellnm · 2025-05-15T19:59:06Z

vllm/platforms/cuda.py

@@ -158,7 +158,6 @@ def check_and_update_config(cls, vllm_config: "VllmConfig") -> None:
                "currently not supported with CUDA Graphs.")
            vllm_config.model_config.enforce_eager = True
            compilation_config.use_cudagraph = False
-            compilation_config.use_inductor = False


Removing this now will break things if eager mode is not used.

Although, vllm_config.model_config.enforce_eager = True can be removed. I didn't want to land the PR with that just in case there were other issues.

Does vllm_config.model_config.enforce_eager = True cover us?

Actually do things break if we set --enforce-eager but also make torch.compile happen with the compilation config?

Afaict, if the inductor is on, then it'll break no matter what other options are set. But, cudagraphs + eager backend work just fine.

tlrmchlsmth

Let's be sure that removing the compilation_config.use_inductor = False doesn't break anything - otherwise lgtm

Signed-off-by: youkaichao <youkaichao@gmail.com>

youkaichao · 2025-05-16T03:20:07Z

The procedure now:

during distributed environment initialization

only for EP group with expert parallel enabled, cuda communicator creates the manager based on VLLM_ALL2ALL_BACKEND, initializes nvshmem if necessary.

after model is created

model runner calls prepare_communication_buffer_for_model
EP group's prepare_communication_buffer_for_model calls init_prepare_finalize on every MoE layer's quant_method
init_prepare_finalize accepts moe_config and quant_config, and it can call get_ep_group().device_communicator.all2all_manager.get_handle to get all2all handle, create prepare_finalize object, and call select_gemm_impl to select the gemm implementation, finally assemble the FusedMoEModularKernel.

Signed-off-by: youkaichao <youkaichao@gmail.com>

bnellnm · 2025-05-16T15:44:13Z

vllm/distributed/device_communicators/cuda_communicator.py

    def dispatch(
            self, hidden_states: torch.Tensor,
            router_logits: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]:
-        assert self.all2all_impl is not None
-        hidden_states, router_logits = self.all2all_impl.dispatch(
+        assert self.all2all_manager is not None
+        hidden_states, router_logits = self.all2all_manager.dispatch(
            hidden_states, router_logits)
        return hidden_states, router_logits

    def combine(self, hidden_states: torch.Tensor) -> torch.Tensor:
-        assert self.all2all_impl is not None
-        hidden_states = self.all2all_impl.combine(hidden_states)
+        assert self.all2all_manager is not None
+        hidden_states = self.all2all_manager.combine(hidden_states)
        return hidden_states


How would these methods work if we weren't using the naive manager? e.g. the pplx all2all object might have a different instance for each layer.

yes so the dispatch/combine in the DeviceCommunicatorBase is not used for pplx kernel, and I agree with your prepare/finalize call inside every layer now. I will try to remove dispatch/combine in the DeviceCommunicatorBase.

I will try to remove dispatch/combine in the DeviceCommunicatorBase

this would be in a future PR, and we need to have prepare_finalize for the naive all2all implementation, then we can remove these functions in DeviceCommunicatorBase

bnellnm · 2025-05-16T15:50:18Z

vllm/model_executor/layers/fused_moe/layer.py

+    def select_gemm_impl(
+        self, prepare_finalize: Optional[FusedMoEPrepareAndFinalize]
+    ) -> FusedMoEPermuteExpertsUnpermute:
+        # based on the all2all implementation, select the appropriate
+        # gemm implementation
+        raise NotImplementedError


I think there should be some sort of error logging here so it's obvious that the combination of pplx + particular MoE implementation is not supported. Or maybe in init_prepare_finalize?

why here? this is the base class, and it just shows the interface. selection of gemm kernels for pplx is at UnquantizedFusedMoEMethod.select_gemm_impl

It doesn't necessarily have to be here but it would be nice to get more than a NotImplementedError exception.

added comments in 5b4095b

Signed-off-by: youkaichao <youkaichao@gmail.com>

mergify · 2025-05-17T16:05:54Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @youkaichao.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: youkaichao <youkaichao@gmail.com>

youkaichao added 3 commits May 15, 2025 15:15

revert enable_expert_parallel everywhere

cbe41ef

Signed-off-by: youkaichao <youkaichao@gmail.com>

tmp

5c6ef5e

Signed-off-by: youkaichao <youkaichao@gmail.com>

tmp

29aebf6

Signed-off-by: youkaichao <youkaichao@gmail.com>

mergify bot added v1 tpu Related to Google TPUs labels May 15, 2025

youkaichao added 7 commits May 15, 2025 19:09

fix typing

058d8e5

Signed-off-by: youkaichao <youkaichao@gmail.com>

fix typing

b5dce6f

Signed-off-by: youkaichao <youkaichao@gmail.com>

document options

ed9299a

Signed-off-by: youkaichao <youkaichao@gmail.com>

fix inductor

88cb9d5

Signed-off-by: youkaichao <youkaichao@gmail.com>

fix shutdown error

729239f

Signed-off-by: youkaichao <youkaichao@gmail.com>

fix shutdown error

1bf90d6

Signed-off-by: youkaichao <youkaichao@gmail.com>

comment

4dc2455

Signed-off-by: youkaichao <youkaichao@gmail.com>

youkaichao changed the title ~~Refactor pplx init logic~~ Refactor pplx init logic to make it more moduler and prepare for deepep May 15, 2025

youkaichao changed the title ~~Refactor pplx init logic to make it more moduler and prepare for deepep~~ Refactor pplx init logic to make it modular (prepare for deepep) May 15, 2025

youkaichao marked this pull request as ready for review May 15, 2025 11:35

youkaichao requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac, alexm-redhat and zhuohan123 as code owners May 15, 2025 11:35

youkaichao added 4 commits May 15, 2025 19:38

fix max_num_tokens

b680ce9

Signed-off-by: youkaichao <youkaichao@gmail.com>

add comments

f5c6b57

Signed-off-by: youkaichao <youkaichao@gmail.com>

fix max_num_tokens

ad70c44

Signed-off-by: youkaichao <youkaichao@gmail.com>

allow per-layer all2all

b20e977

Signed-off-by: youkaichao <youkaichao@gmail.com>

simon-mo requested a review from tlrmchlsmth May 15, 2025 16:15

bnellnm reviewed May 15, 2025

View reviewed changes

tlrmchlsmth approved these changes May 16, 2025

View reviewed changes

merge into init_prepare_finalize

1e60d54

Signed-off-by: youkaichao <youkaichao@gmail.com>

youkaichao requested a review from mgoin as a code owner May 16, 2025 02:33

youkaichao added 5 commits May 16, 2025 10:38

disable inductor

15d673b

Signed-off-by: youkaichao <youkaichao@gmail.com>

fix

63f029b

Signed-off-by: youkaichao <youkaichao@gmail.com>

rename to manager

d75ac1c

Signed-off-by: youkaichao <youkaichao@gmail.com>

fix for non-pplx

60499df

Signed-off-by: youkaichao <youkaichao@gmail.com>

fix for non-pplx

cd6858e

Signed-off-by: youkaichao <youkaichao@gmail.com>

youkaichao marked this pull request as draft May 16, 2025 03:05

youkaichao added 2 commits May 16, 2025 11:13

move prepare_communication_buffer_for_model to base

3f6a862

Signed-off-by: youkaichao <youkaichao@gmail.com>

fix no ep case

d419736

Signed-off-by: youkaichao <youkaichao@gmail.com>

youkaichao added 7 commits May 16, 2025 11:31

annotate moe and quant_config

5e9d2c9

Signed-off-by: youkaichao <youkaichao@gmail.com>

fix typing

522ea26

Signed-off-by: youkaichao <youkaichao@gmail.com>

fix init

f3fc838

Signed-off-by: youkaichao <youkaichao@gmail.com>

fix cross-node init

c3cd65c

Signed-off-by: youkaichao <youkaichao@gmail.com>

fix typing

259a724

Signed-off-by: youkaichao <youkaichao@gmail.com>

fix typing

52945ba

Signed-off-by: youkaichao <youkaichao@gmail.com>

fix typing

9c73776

Signed-off-by: youkaichao <youkaichao@gmail.com>

bnellnm reviewed May 16, 2025

View reviewed changes

meaningful comments

5b4095b

Signed-off-by: youkaichao <youkaichao@gmail.com>

mergify bot added the needs-rebase label May 17, 2025

Merge branch 'main' into refactor_pplx

0944f27

mergify bot removed the needs-rebase label May 17, 2025

youkaichao marked this pull request as ready for review May 18, 2025 17:39

youkaichao enabled auto-merge (squash) May 19, 2025 03:05

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label May 19, 2025

fix error

cfa027b

Signed-off-by: youkaichao <youkaichao@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor pplx init logic to make it modular (prepare for deepep) #18200

Refactor pplx init logic to make it modular (prepare for deepep) #18200

youkaichao commented May 15, 2025 •

edited

Loading

github-actions bot commented May 15, 2025

simon-mo commented May 15, 2025

bnellnm May 15, 2025

bnellnm May 15, 2025

tlrmchlsmth May 16, 2025

bnellnm May 16, 2025

tlrmchlsmth left a comment

youkaichao commented May 16, 2025 •

edited

Loading

bnellnm May 16, 2025

youkaichao May 16, 2025

youkaichao May 17, 2025

bnellnm May 16, 2025

youkaichao May 16, 2025

bnellnm May 16, 2025

youkaichao May 17, 2025

mergify bot commented May 17, 2025

Refactor pplx init logic to make it modular (prepare for deepep) #18200

Are you sure you want to change the base?

Refactor pplx init logic to make it modular (prepare for deepep) #18200

Conversation

youkaichao commented May 15, 2025 • edited Loading

github-actions bot commented May 15, 2025

simon-mo commented May 15, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlrmchlsmth left a comment

Choose a reason for hiding this comment

youkaichao commented May 16, 2025 • edited Loading

during distributed environment initialization

after model is created

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented May 17, 2025

youkaichao commented May 15, 2025 •

edited

Loading

youkaichao commented May 16, 2025 •

edited

Loading