Add decoder custom modeling for inference based on NxD #840

dacorvo · 2025-04-30T13:58:32Z

What does this PR do?

This adds support for the export and inference of decoder models on top of neuronx-distributed.

For now only two model architectures are supported: llama and mixtral.

Note that the existing custom modeling for llama on top of TnX is still chosen by default when using NeuronModelForCausalLM.

To export or instantiate a llama model on top of NxD instead, either:

use directly LlamaNxDModelForCausalLM,
export OPTIMUM_PRIORITIZE_NXD_BACKEND=1, then use NeuronModelForCausalLM.

The basic features are all implemented:

export/save/reload,
generate with multinomial sampling or greedy,
transparent local cache,
transparent hub cache.

A new cool feature has been added: assisted/speculative decoding.

What's missing:

support for continuous batching,
integration in TGI,
serval modeling optimizations are not working (yet): they will be fixed in individual pull-requests or removed.

Performance:

inference time seems in line with the numbers obtained using the direct HLO modeling,
device memory usage is much higher, quickly leading to a saturation when increasing the batch size. It might be because some outputs/cached vales are stored with a higher precision.

It is not strictly necessary to use the GenerationMixin to implement the generate method, so it is moved to the child class that uses it.

This removes obsolete or redundant attributes and add explanations about AutoModel registration for pipelines.

This allows to force the NeuronModelForCausalLM factory methods to export/load a specific model with the NxD backend even if an equivalent HLO model exists.

The generate_neff method from torch_neuronx that is used by the ModelBuilder class does not support caching. This patches this method to replace it by a wrapper that uses the caching mechanism that has been implemented in the libneuronxla package.

HuggingFaceDocBuilderDev · 2025-04-30T14:02:09Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

JingyaHuang · 2025-05-04T19:40:31Z

optimum/neuron/models/inference/nxd/backend/config.py

+NEURON_CONFIG_FILE = "neuron_config.json"
+
+
+def to_torch_dtype(dtype_str: str) -> torch.dtype:


There are so many mappings more or less the same in Optimum Neuron, couldn't we put them in a utils that everyone could leverage?

optimum-neuron/optimum/neuron/models/inference/hlo/backend/dtypes.py

Lines 19 to 23 in a8b9035

def to_torch_dtype(dtype):

mapping = {

"fp32": torch.float32,

"fp16": torch.float16,

"bf16": torch.bfloat16,

optimum-neuron/optimum/neuron/utils/misc.py

Lines 626 to 629 in a8b9035

def map_torch_dtype(dtype: Union[str, torch.dtype]):

dtype_mapping = {

"bfloat16": torch.bfloat16,

"float16": torch.float16,

optimum-neuron/optimum/exporters/neuron/base.py

Line 324 in a8b9035

mapper = {torch.float32: "fp32", torch.float16: "fp16", torch.bfloat16: "bf16"}

JingyaHuang · 2025-05-04T19:44:38Z

optimum/neuron/models/inference/nxd/backend/config.py

+
+
+@register_neuron_config
+class NxDNeuronConfig(NeuronConfig):


Is it the same as the NeuronConfig of NxDI, do we need all following features?

JingyaHuang · 2025-05-04T19:49:16Z

optimum/neuron/models/inference/nxd/backend/modules/attention/attention_base.py

+    SHARDED_KERNEL = 2
+
+
+class NeuronAttentionBase(nn.Module):


Maybe a comment referring the corresponding part in NxDI, so we won't forget.

JingyaHuang · 2025-05-04T19:50:02Z

optimum/neuron/models/inference/nxd/backend/modules/attention/gqa.py

+_traced_qkv_kernel = nki_jit()(rmsnorm_qkv_isa_kernel)
+
+
+class GQA(enum.Enum):


JingyaHuang · 2025-05-04T19:50:28Z

optimum/neuron/models/inference/nxd/backend/modules/attention/utils.py

+weight_cache = {}
+
+
+def _get_weight_from_state_dict(prefix: str, state_dict: Dict[str, Any]) -> torch.Tensor:


JingyaHuang · 2025-05-04T19:51:07Z

optimum/neuron/models/inference/nxd/backend/modules/autobucketing.py

+import torch
+
+
+def generate_buckets(min_length: int, max_length: int):


JingyaHuang · 2025-05-04T19:55:33Z

optimum/neuron/models/inference/nxd/backend/modules/flashdecode/README.md

+Flash decoding supports long context inference by reducing KV cache memory. This is done by sharding and distributing 
+cache storage instead of replicating it on multiple devices (cores).
+
+Flash decoding lives in context of GQA (group query attention). This means it is a feature on top of GQA and not 
+traditional MHA (multi-head attention). In GQA we replicate the KV cache in the devices within the same KV group. 
+Now instead of replicating, we shard the KV and distribute them in each device of the group. To accommodate this setup, 
+we modify the attention computation as below:
+1) Gather all query heads in the group, 
+2) Compute partial softmax on each device, 
+3) Reduce-scatter in the end to get the complete result.


Suggested change

Flash decoding supports long context inference by reducing KV cache memory. This is done by sharding and distributing

cache storage instead of replicating it on multiple devices (cores).

Flash decoding lives in context of GQA (group query attention). This means it is a feature on top of GQA and not

traditional MHA (multi-head attention). In GQA we replicate the KV cache in the devices within the same KV group.

Now instead of replicating, we shard the KV and distribute them in each device of the group. To accommodate this setup,

we modify the attention computation as below:

1) Gather all query heads in the group,

2) Compute partial softmax on each device,

3) Reduce-scatter in the end to get the complete result.

Flash decoding supports long context inference by reducing KV cache memory. This is done by sharding and distributing

cache storage instead of replicating it on multiple devices (cores).

Flash decoding lives in the context of GQA (group query attention). This means it is a feature on top of GQA and not

traditional MHA (multi-head attention). In GQA, we replicate the KV cache in the devices within the same KV group.

Now, instead of replicating, we shard the KV and distribute them to each device in the group. To accommodate this setup, we modify the attention computation as follows:

1) Gather all query heads in the group,

2) Compute partial softmax on each device,

3) Reduce-scatter in the end to get the complete result.

JingyaHuang · 2025-05-04T20:12:25Z

optimum/neuron/models/inference/nxd/backend/cache.py

+            Directory to store the cache. If not provided, a default directory will be used.
+    """
+
+    def generate_neff_with_cache(


That's awesome, did not think of overriding this. Will check how to use it for other traced models :D

dacorvo added 12 commits April 17, 2025 12:04

refactor(decoder): move GenerationMixin inheritance to child class

da7fa38

It is not strictly necessary to use the GenerationMixin to implement the generate method, so it is moved to the child class that uses it.

feat(decoder): cleanup auto model related code

d2d6d90

This removes obsolete or redundant attributes and add explanations about AutoModel registration for pipelines.

fix(config): correct serialization of torch.dtype

f0dc30f

feat(models): add NxD inference models

ce4c136

feat(auto_models): add env flag to prioritize NxD backend

d73f170

This allows to force the NeuronModelForCausalLM factory methods to export/load a specific model with the NxD backend even if an equivalent HLO model exists.

fix(examples): neuron_config is not a Dict anymore

9f721f1

fix(decoder): warn about unused kwargs

00081fe

fix(benchmark): neuron_config is not a Dict anymore

95c07ce

feat(decoder): support hub cache for NxD models

4cf4755

fixme(config): reject flags that are not working

206a573

feat(llama): use better default options

959e579

dacorvo requested a review from tengomucho April 30, 2025 13:58

JingyaHuang reviewed May 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add decoder custom modeling for inference based on NxD #840

Add decoder custom modeling for inference based on NxD #840

dacorvo commented Apr 30, 2025 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 30, 2025

JingyaHuang May 4, 2025

JingyaHuang May 4, 2025

JingyaHuang May 4, 2025

JingyaHuang May 4, 2025

JingyaHuang May 4, 2025

JingyaHuang May 4, 2025

JingyaHuang May 4, 2025

JingyaHuang May 4, 2025

		NEURON_CONFIG_FILE = "neuron_config.json"


		def to_torch_dtype(dtype_str: str) -> torch.dtype:

	def to_torch_dtype(dtype):
	mapping = {
	"fp32": torch.float32,
	"fp16": torch.float16,
	"bf16": torch.bfloat16,

	def map_torch_dtype(dtype: Union[str, torch.dtype]):
	dtype_mapping = {
	"bfloat16": torch.bfloat16,
	"float16": torch.float16,

		_traced_qkv_kernel = nki_jit()(rmsnorm_qkv_isa_kernel)


		class GQA(enum.Enum):

		weight_cache = {}


		def _get_weight_from_state_dict(prefix: str, state_dict: Dict[str, Any]) -> torch.Tensor:

		import torch


		def generate_buckets(min_length: int, max_length: int):

Add decoder custom modeling for inference based on NxD #840

Are you sure you want to change the base?

Add decoder custom modeling for inference based on NxD #840

Conversation

dacorvo commented Apr 30, 2025 • edited Loading

What does this PR do?

HuggingFaceDocBuilderDev commented Apr 30, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dacorvo commented Apr 30, 2025 •

edited

Loading