Add NeuronxDistributedInference support, Speculative Decoding, Dynamic on-device sampling #16357

aws-satyajith · 2025-04-09T18:30:40Z

Add the following Neuron features as a part of RFC #15970 :

NeuronX Distributed (NxD) Inference Support
1. Allow customers to select a framework based on preference or availability. Default to neuronx-distributed-inference (NxD); if unavailable, fall back to transformers-neuronx (TNx).
2. Support inference using NxD by adding a worker/neuronx_distributed_model_runner.py
3. Add framework detection utility that returns the current framework in use.
Speculative Decoding
1. To enable speculative decoding with NxD, we added worker/multi_step_neuronx_distributed_model_runner.py.
2. To enable speculative decoding with TnX, we added worker/multi_step_neuron_model_runner.py. This model runner is chosen in neuron_worker.py if speculation is enabled.
Dynamic On-device Sampling
1. Extract the sampling params (top_k, top_p, temperature) and add them to execute_model().
Multi-node Tensor Parallelism Inference
1. The communication between master and worker nodes happens at two layers,
  the control plane layer for metadata communication (i.e., the input prompts) from master node to work nodes. Specifically, enable do_metadata_broadcast, while supplying conversion methods from ModelInputForNeuron to broadcast-able dictionary and vice versa in neuron_model_runner.py.
  the Neuron backend (i.e., NxD, TNx) is doing all the collectives operations in model forward.
  Examples of usage can be found in [examples/neuron/multi_node](https://github.com/aws-neuron/upstreaming-to-
2. vllm/tree/neuron-2.22-vllm-v0.7.2/examples/neuron/multi_node)

RFC: #15970

Note: This is a fixed version of #16043. The following issues have been fixed in this revision:

Eliminated unnecessary commits.
Signed off each commit.
Resolved merge conflicts that arose in the past few days.

Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

…o conform to the new VllmConfig construct Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

This reverts commit b5140f5. Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

github-actions · 2025-04-09T18:30:52Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

# Conflicts: # vllm/platforms/neuron.py Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

aws-satyajith · 2025-04-10T22:39:54Z

@liangfu @zhuohan123 @youkaichao @alexm-redhat @comaniac @njhill

Could you please review this PR? There is an RFC related to this as well for additional context and discussion: #15970

Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

# Conflicts: # vllm/worker/neuron_model_runner.py

liangfu

Thank you @aws-satyajith for contributing. I left a few comment in the PR.

These comments are specifically concerning:

isolate num_lookahead_slots related change to a separate PR
move get_neuron_framework_to_use function from vllm/worker/utils.py to vllm/platform/neuron.py
isolate multi-node example to a separate PR, since the behavior does not seem to be consistent with other hardware backends.
to avoid environment variable type errors, it would be better to define environment variable types in vllm/envs.py with VLLM_NEURON_ prefix

In addition, i feel there are a set of features/components that are bundled in this PR.
I propose to break down the bundled PR into a few individual components/features.

1/ introduce neuronx-distributed-inference as a dependency for the neuron backend, and replace the existing transformers-neuronx based implementation (for simplicity), with a basic test to ensure the integration does not break in the future.
2/ add on-device sampling support, with test script
3/ add speculative decoding support, with test script

If we do not remove transformers-neuronx based implementation, there would be:
a/ 4 model_runner scripts in the worker directory.
b/ the behavior of the two packages (transformers-neuronx and neuronx-distributed-inference) may or may not be consistent across different features and configurations.

tests/worker/test_neuron_model_runner.py

liangfu · 2025-04-30T00:31:39Z

examples/neuron/multi_node/multi_node_launcher.sh

+
+# Use mpirun to trigger inference on head/worker nodes
+
+/opt/amazon/openmpi/bin/mpirun \


similar to #8692

i would propose to be consistent with eco-system and leverage interfaces described in https://docs.vllm.ai/en/latest/serving/distributed_serving.html

Thanks for linking the PR and docs. I checked them out and read the associated comments. In summary:

You would propose that we should be able to perform multi-node inference using

vllm serve /path/to/the/model/in/the/container \ --tensor-parallel-size 128

We should document the -x fields below and explain the reason for the setup.

Both these require re-working the feature significantly. The previous PR was closed out as the above pending items were not addressed. Is this a good understanding of the situation?

yes, that's a good understanding.

I discussed with @mrinalks and we decided to not include multi-node support in this PR as this needs to be heavily reworked. I'll remove all the multi-node specific code in the following revision.

liangfu · 2025-04-30T00:34:51Z

examples/offline_inference/neuron_eagle.py

+
+# Create an LLM.
+llm = LLM(
+    model=TARGET_MODEL_PATH,


stay consistent with other offline script?

e.g.

def main(): # ... some details ... llm = LLM( model="TinyLlama/TinyLlama-1.1B-Chat-v1.0", ... )

Just confirming: Do you mean, we should change the model to TinyLlama? Or do you mean we should remove the constants TARGET_MODEL_PATH and directly use the string in-line?

it's the later, since demonstrate eagle isn't going to be feasible with TinyLlama.

liangfu · 2025-04-30T00:37:36Z

vllm/config.py

                    from vllm.transformers_utils.configs.eagle import (
                        EAGLEConfig)
                    if isinstance(self.draft_model_config.hf_config,
-                                  EAGLEConfig):
+                                  EAGLEConfig) or current_platform.is_neuron():


can we eliminate the device-specific changes in vllm/config.py ?

+1 we should move this neuron config to override.

@mrinalks We currently use the self.draft_model_config.hf_config in multiple places. Moving this to override_neuron_config would mean deviating from the existing flow and would require comprehensive re-testing to ensure we didn't miss any parameters.

@liangfu EAGLEConfig was not present when we implemented EAGLE support in Neuron, hence, I put in the exception. I'll take a look if we can remove this exception for neuron.

This is a valid change that we'll implement and test internally first. We will address supporting EAGLEConfig as a follow-up commit. I'm adding a comment on RFC #15970 to keep track of this change.

liangfu · 2025-04-30T00:38:37Z

vllm/engine/llm_engine.py

+        if self.device_config.device_type == "neuron":
+            num_lookahead_slots = self.scheduler_config.num_lookahead_slots
+        else:
+            num_lookahead_slots = 0


can we eliminate the device-specific changes in vllm/engine/llm_engine.py ?

We can probably move this into override_neuron_config. I'll look into making that change.

Thank you for bringing this up @liangfu. I looked into this a little deeper and this has potential implications for non-neuron workflows. Some notes:

num_lookahead_slots is a part of self.scheduler_config and we’re setting it to 0 for all non-neuron cases irrespective of what the actual value is.
This is problematic for two reasons:

If someone uses the num_lookahead_slots in the StopChecker in the future for non-neuron workflow, they will not see the value from scheduler_config.num_lookahead_slots . Instead they’ll see 0.

We want to avoid hardware dependent exceptions for StopChecker as pointed out by you in the parent comment.

I'll check further and see if we can remove this dependency altogether or come up with a non-impacting solution.

Discussed this further with @elaineyz and identified steps forward to address both the above points. I'll need some additional time to address this but I'll include changes to this in the next revision.

Agreed with Liangfu to remove num_lookahead_slots from llm_engine and stop_checker. This will impact customers using ignore_eos = True. We will address this issue separately as a part of RFC #15970

Leaving a comment on the RFC as well for tracking.

vllm/worker/neuronx_distributed_model_runner.py

vllm/worker/neuron_worker.py

liangfu · 2025-04-30T04:45:24Z

vllm/worker/neuron_worker.py

+            self.rank = int(os.getenv("NEURON_RANK_ID",
+                                      DEFAULT_NEURON_RANK_ID))


If we would like to keep these environment variables, i think it's better to be consistent with environment_variables in vllm/envs.py

similar to

VLLM_ROCM_USE_AITER: bool = False VLLM_ROCM_USE_AITER_PAGED_ATTN: bool = False VLLM_ROCM_USE_AITER_LINEAR: bool = True VLLM_ROCM_USE_AITER_MOE: bool = True VLLM_ROCM_USE_AITER_RMSNORM: bool = True VLLM_ROCM_USE_AITER_MLA: bool = True VLLM_ROCM_USE_SKINNY_GEMM: bool = True VLLM_ROCM_FP8_PADDING: bool = True VLLM_ROCM_MOE_PADDING: bool = True VLLM_ROCM_CUSTOM_PAGED_ATTN: bool = True

we may introduce the prefix VLLM_NEURON_ for neuron-specific environment variables.

liangfu · 2025-04-30T04:47:08Z

vllm/worker/neuron_worker.py

+        self.enable_neuron_multi_node = (os.getenv(
+            "ENABLE_NEURON_MULTI_NODE",
+            DEFAULT_ENABLE_NEURON_MULTI_NODE).lower() == "true")


it's error-prone to lower string and compare with "true".

i propose to define environment-variable types in https://github.com/vllm-project/vllm/blob/main/vllm/envs.py

vllm/worker/neuron_worker.py

liangfu

I think it's hard to move forward as is, since there are quite a lot of code-changes bundled in this PR.
In addition, the community intend to "fully remove V0" as part of the Q2 plan (#15735).

@robertgshaw2-redhat @simon-mo @comaniac what do you think?

comaniac · 2025-04-30T18:06:56Z

I think it's hard to move forward as is, since there are quite a lot of code-changes bundled in this PR. In addition, the community intend to "fully remove V0" as part of the Q2 plan (#15735).

@robertgshaw2-redhat @simon-mo @comaniac what do you think?

Sorry for over looking and I second @liangfu on this. Specifically we are trying to freeze and deprecate v0 in Q2 if possible, so we are not in favor of taking large changes other than bug fixes in v0. We however would be happy to discuss an RFC for Neuron integration in v1. Also cc @WoosukKwon

mrinalks · 2025-04-30T19:14:43Z

I think it's hard to move forward as is, since there are quite a lot of code-changes bundled in this PR. In addition, the community intend to "fully remove V0" as part of the Q2 plan (#15735).

@robertgshaw2-redhat @simon-mo @comaniac what do you think?
Sorry for over looking and I second @liangfu on this. Specifically we are trying to freeze and deprecate v0 in Q2 if possible, > so we are not in favor of taking large changes other than bug fixes in v0. We however would be happy to discuss an RFC for Neuron integration in v1. Also cc @WoosukKwon

We have agreement from @robertgshaw2-redhat (+simon-mo and +woosuk I believe as well..) to merge NxDI V0 changes as a last RFC to vLLM and then grabbing a snapshot of the final changes for our customers for longer-term support of V0 for those customers that need more time to migrate to V1 due to performance reasons on Neuron.

Meanwhile we are also prepping a Q2 Roadmap to align with vLLM Maintainers (I believe @simon-mo) asked for it a few weeks back which focuses entirely on V1 Efforts. All in all we are aligned V0 is getting deprecated.. we just need our changes merged/functional and archived so we can cleanly switch over to V1 as we work towards V1 Architecture + Neuron (performant drops). More on this in the next couple of weeks!

…top checker. Other PR specific changes. Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

…_config Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

liangfu

Thank you @aws-satyajith for the update. I'm okay with the proposed change. I look forward to working together on the follow-up RFC for NxDI V1 support.

mrinalks · 2025-05-07T07:05:09Z

@simon-mo @WoosukKwon @robertgshaw2-redhat could you please review or approve this pull request so we can merge our final V0 RFC.
per your earlier request, @liangfu has also reviewed and blessed the pull request.

mrinalks · 2025-05-07T07:15:52Z

Thanks @simon-mo 🥳

…c on-device sampling (vllm-project#16357) Signed-off-by: Satyajith Chilappagari <satchill@amazon.com> Co-authored-by: Aaron Dou <yzdou@amazon.com> Co-authored-by: Shashwat Srijan <sssrijan@amazon.com> Co-authored-by: Chongming Ni <chongmni@amazon.com> Co-authored-by: Amulya Ballakur <amulyaab@amazon.com> Co-authored-by: Patrick Lange <patlange@amazon.com> Co-authored-by: Elaine Zhao <elaineyz@amazon.com> Co-authored-by: Lin Lin Pan <tailinpa@amazon.com> Co-authored-by: Navyadhara Gogineni <navyadha@amazon.com> Co-authored-by: Yishan McNabb <yishanm@amazon.com> Co-authored-by: Mrinal Shukla <181322398+mrinalks@users.noreply.github.com> Signed-off-by: Mu Huai <tianbowen.tbw@antgroup.com>

…c on-device sampling (vllm-project#16357) Signed-off-by: Satyajith Chilappagari <satchill@amazon.com> Co-authored-by: Aaron Dou <yzdou@amazon.com> Co-authored-by: Shashwat Srijan <sssrijan@amazon.com> Co-authored-by: Chongming Ni <chongmni@amazon.com> Co-authored-by: Amulya Ballakur <amulyaab@amazon.com> Co-authored-by: Patrick Lange <patlange@amazon.com> Co-authored-by: Elaine Zhao <elaineyz@amazon.com> Co-authored-by: Lin Lin Pan <tailinpa@amazon.com> Co-authored-by: Navyadhara Gogineni <navyadha@amazon.com> Co-authored-by: Yishan McNabb <yishanm@amazon.com> Co-authored-by: Mrinal Shukla <181322398+mrinalks@users.noreply.github.com>

…c on-device sampling (vllm-project#16357) Signed-off-by: Satyajith Chilappagari <satchill@amazon.com> Co-authored-by: Aaron Dou <yzdou@amazon.com> Co-authored-by: Shashwat Srijan <sssrijan@amazon.com> Co-authored-by: Chongming Ni <chongmni@amazon.com> Co-authored-by: Amulya Ballakur <amulyaab@amazon.com> Co-authored-by: Patrick Lange <patlange@amazon.com> Co-authored-by: Elaine Zhao <elaineyz@amazon.com> Co-authored-by: Lin Lin Pan <tailinpa@amazon.com> Co-authored-by: Navyadhara Gogineni <navyadha@amazon.com> Co-authored-by: Yishan McNabb <yishanm@amazon.com> Co-authored-by: Mrinal Shukla <181322398+mrinalks@users.noreply.github.com> Signed-off-by: Yuqi Zhang <yuqizhang@google.com>

Aaron Dou and others added 30 commits April 9, 2025 16:49

[NxDI upstream foundation] set up NxDI model runner

6531f92

Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

Support speculation with transformers-neuronx

626850c

Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

Add support for eagle speculation using transformers-neuronx

7fcf9b2

Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

Support speculation with neuronx-distributed-inference for batch 1

3a8e7a5

Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

[Tnx] Fix streaming flow for speculation

bf9a4c6

Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

[NxdI] Support eagle speculation

a2671aa

Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

[TNx+EAGLE] Use FusedSpeculativeDecoder for EAGLE + Linear token tree.

0b075b1

Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

Fix the termination check for accepted speculative tokens.

703efd9

Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

[TNx][Bug fix] fix the incorrect speculation output check.

6c04502

Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

Add continuous batching with eagle

31203d7

Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

[NxDI] Fix masking of padding in speculative output

ae29889

Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

Remove assertion on bs=1 when using speculation now that we support bs>1

466cd01

Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

Modify NxDI and TNx multi step model runners (used for speculation) t…

ac90709

…o conform to the new VllmConfig construct Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

Add multi-step NxD model runner

f10f17b

Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

Add Framework selection logic

d123747

Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

Fix no free blocks error

547709a

Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

Refactor and add basic docstrings

e91857a

Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

Modification to enable Vllm-neuronx instead of Vllm for KTF

70ad6a9

Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

Fix global_top_k to be aligned with NxDI default

2df94fe

Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

Add neuron model runner tests for updating sampling param

da8d1cf

Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

Updating requirements-neuron.txt

498a918

Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

Updating requirements-neuron.txt

8bc6537

Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

multi-node TP support

4907e3e

Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

Removing codenames that fail IP Scanning

248b708

Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

Format auto-check and formatting changes

4d3c6ae

Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

Bug fix: Missing NxDI model runner addressed

51f403f

Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

set world_size default value as 1

543f55d

Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

Formatting changes to satisfy all pre-commit hooks

d15c20b

Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

Revert "Removing codenames that fail IP Scanning"

49d7558

This reverts commit b5140f5. Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

Fix issues identified by mypy checks

87eeb3c

Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

aws-satyajith requested review from youkaichao, alexm-redhat, comaniac and njhill as code owners April 9, 2025 18:30

mergify bot added documentation Improvements or additions to documentation ci/build labels Apr 9, 2025

Merge branch 'upstreaming_main' into upstream-neuron-vllm-04-08

7c86b26

# Conflicts: # vllm/platforms/neuron.py Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

robertgshaw2-redhat self-assigned this Apr 18, 2025

aws-satyajith added 3 commits April 22, 2025 14:55

Merge branch 'main' into upstream-neuron-vllm-04-08

d34883c

Add docstring for speculative_token_tree

d762abd

Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

Merge branch 'upstreaming_main' into upstream-neuron-vllm-04-08

9d49902

# Conflicts: # vllm/worker/neuron_model_runner.py

liangfu suggested changes Apr 30, 2025

View reviewed changes

liangfu reviewed Apr 30, 2025

View reviewed changes

mrinalks and others added 2 commits May 2, 2025 14:12

Merge branch 'vllm-project:main' into upstream-neuron-vllm-04-08

a31715c

Remove multi-node support. Remove num_lookahead_slots exception for s…

fa055e5

…top checker. Other PR specific changes. Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

aws-satyajith mentioned this pull request May 3, 2025

[Neuron][Build] Require setuptools >= 77.0.3 for PEP 639 #17603

Merged

aws-satyajith added 2 commits May 2, 2025 19:47

Merge branch 'vllm-project:main' into upstream-neuron-vllm-04-08

8092f4b

Modify neuron speculative decoding examples to use latest speculative…

e298c41

…_config Signed-off-by: Satyajith Chilappagari <satchill@amazon.com>

liangfu reviewed May 3, 2025

View reviewed changes

simon-mo merged commit 043e4c4 into vllm-project:main May 7, 2025
25 checks passed

aws-satyajith deleted the upstream-neuron-vllm-04-08 branch May 7, 2025 17:31


		# Use mpirun to trigger inference on head/worker nodes

		/opt/amazon/openmpi/bin/mpirun \

		self.rank = int(os.getenv("NEURON_RANK_ID",
		DEFAULT_NEURON_RANK_ID))

Uh oh!

Add NeuronxDistributedInference support, Speculative Decoding, Dynamic on-device sampling #16357

Add NeuronxDistributedInference support, Speculative Decoding, Dynamic on-device sampling #16357

Uh oh!

Conversation

aws-satyajith commented Apr 9, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 9, 2025

Uh oh!

aws-satyajith commented Apr 10, 2025

Uh oh!

liangfu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

liangfu left a comment

Choose a reason for hiding this comment

Uh oh!

comaniac commented Apr 30, 2025

Uh oh!

mrinalks commented Apr 30, 2025

Uh oh!

liangfu left a comment

Choose a reason for hiding this comment

Uh oh!

mrinalks commented May 7, 2025

Uh oh!

Uh oh!

mrinalks commented May 7, 2025

Uh oh!

Uh oh!

aws-satyajith commented Apr 9, 2025 •

edited by github-actions bot

Loading