Error when computing device_map for Mistral-small-3.1-24B-Instruct-2503 #1403

VAmblardPEReN · 2025-04-30T08:10:32Z

Describe the bug
The method calculate_offload_device_map fails with Mistral-Small-3.1-24B-Instruct-2503 due to a ValueError "Could not find targets ['Mistral3VisionAttention'] in module Mistral3ForConditionalGeneration".
(I was able to trace back the error and it comes from the method match_layers_params called with targets='Mistral3VisionAttention').

Expected behavior
I would expect the calculate_offload_device_map to output a device_map.

Environment
Include all relevant environment information:

OS : Debian 12
Python version : 3.11
LLM Compressor version or commit hash: 0.5.1
ML framework version(s) [e.g. torch 2.3.1]: torch : 2.7.0+cu118
Other Python package versions [e.g. vLLM, compressed-tensors, numpy, ONNX]: compressed-tensors: 0.9.4, transformers: 4.51.3
Other relevant environment information [e.g. hardware, CUDA version]: Cuda 11.8

To Reproduce

from llmcompressor.transformers.compression.helpers import calculate_offload_device_map
from transformers import AutoModelForImageTextToText

model_name = 'mistralai/Mistral-Small-3.1-24B-Instruct-2503'
device_map = calculate_offload_device_map(
    model_name, num_gpus=3, reserve_for_hessians=True, torch_dtype=torch.bfloat16, model_cls=AutoModelForImageTextToText
)

Errors

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[5], line 5
      2 from transformers import AutoModelForImageTextToText, AutoTokenizer, AutoModelForCausalLM
      3 import torch
----> 5 device_map = calculate_offload_device_map(
      6     model_name, num_gpus=3, reserve_for_hessians=True, torch_dtype=torch.bfloat16, model_cls=AutoModelForImageTextToText
      7 )

File [<PATH>/site-packages/llmcompressor/transformers/compression/helpers.py:245](<PATH>/site-packages/llmcompressor/transformers/compression/helpers.py#line=244), in calculate_offload_device_map(model_stub, reserve_for_hessians, num_gpus, torch_dtype, model_cls, **model_kwargs)
    243 reserved_memory = 0
    244 if reserve_for_hessians:
--> 245     reserved_memory = hessian_memory_requirements(dummy_model)
    246 reserved_memory += quantization_memory_requirement(dummy_model)
    248 memory_limits = {
    249     idx: (max_memory - reserved_memory)
    250     for idx, max_memory in enumerate(max_gpu_memory)
    251 }

File [<PATH>/site-packages/llmcompressor/transformers/compression/helpers.py:123](<PATH>/site-packages/llmcompressor/transformers/compression/helpers.py#line=122), in hessian_memory_requirements(model)
    114 def hessian_memory_requirements(model: torch.nn.Module) -> int:
    115     """
    116     Determines the number of bytes needed to store Hessian data for a single
    117     transformer layer in model. This is used for reserving memory for GPTQ
   (...)    121     :return: number of bytes required to reserve for GPTQ on a single layer
    122     """
--> 123     transformer_layers = get_layers(get_no_split_params(model), model)
    124     total_hessian_elems = {}
    125     max_column_size = {}

File [<PATH>/site-packages/llmcompressor/utils/pytorch/module.py:168](<PATH>/site-packages/llmcompressor/utils/pytorch/module.py#line=167), in get_layers(targets, module)
    167 def get_layers(targets: Union[str, List[str]], module: Module) -> Dict[str, Module]:
--> 168     return match_layers_params(targets, module)

File [<PATH>/site-packages/llmcompressor/utils/pytorch/module.py:162](<PATH>/site-packages/llmcompressor/utils/pytorch/module.py#line=161), in match_layers_params(targets, module, params)
    160 missed = [target for found, target in zip(targets_found, targets) if not found]
    161 if len(missed) > 0:
--> 162     raise ValueError(f"Could not find targets {missed} in module {module}")
    164 return resolved

ValueError: Could not find targets ['Mistral3VisionAttention'] in module Mistral3ForConditionalGeneration(
  (vision_tower): PixtralVisionModel(
    (patch_conv): Conv2d(3, 1024, kernel_size=(14, 14), stride=(14, 14), bias=False)
    (ln_pre): PixtralRMSNorm((1024,), eps=1e-05)
    (transformer): PixtralTransformer(
      (layers): ModuleList(
        (0-23): 24 x PixtralAttentionLayer(
          (attention_norm): PixtralRMSNorm((1024,), eps=1e-05)
          (feed_forward): PixtralMLP(
            (gate_proj): Linear(in_features=1024, out_features=4096, bias=False)
            (up_proj): Linear(in_features=1024, out_features=4096, bias=False)
            (down_proj): Linear(in_features=4096, out_features=1024, bias=False)
            (act_fn): GELUActivation()
          )
          (attention): PixtralAttention(
            (k_proj): Linear(in_features=1024, out_features=1024, bias=False)
            (v_proj): Linear(in_features=1024, out_features=1024, bias=False)
            (q_proj): Linear(in_features=1024, out_features=1024, bias=False)
            (o_proj): Linear(in_features=1024, out_features=1024, bias=False)
          )
          (ffn_norm): PixtralRMSNorm((1024,), eps=1e-05)
        )
      )
    )
    (patch_positional_embedding): PixtralRotaryEmbedding()
  )
  (multi_modal_projector): Mistral3MultiModalProjector(
    (norm): Mistral3RMSNorm((1024,), eps=1e-06)
    (patch_merger): Mistral3PatchMerger(
      (merging_layer): Linear(in_features=4096, out_features=1024, bias=False)
    )
    (linear_1): Linear(in_features=1024, out_features=5120, bias=False)
    (act): GELUActivation()
    (linear_2): Linear(in_features=5120, out_features=5120, bias=False)
  )
  (language_model): MistralForCausalLM(
    (model): MistralModel(
      (embed_tokens): Embedding(131072, 5120)
      (layers): ModuleList(
        (0-39): 40 x MistralDecoderLayer(
          (self_attn): MistralAttention(
            (q_proj): Linear(in_features=5120, out_features=4096, bias=False)
            (k_proj): Linear(in_features=5120, out_features=1024, bias=False)
            (v_proj): Linear(in_features=5120, out_features=1024, bias=False)
            (o_proj): Linear(in_features=4096, out_features=5120, bias=False)
          )
          (mlp): MistralMLP(
            (gate_proj): Linear(in_features=5120, out_features=32768, bias=False)
            (up_proj): Linear(in_features=5120, out_features=32768, bias=False)
            (down_proj): Linear(in_features=32768, out_features=5120, bias=False)
            (act_fn): SiLU()
          )
          (input_layernorm): MistralRMSNorm((5120,), eps=1e-05)
          (post_attention_layernorm): MistralRMSNorm((5120,), eps=1e-05)
        )
      )
      (norm): MistralRMSNorm((5120,), eps=1e-05)
      (rotary_emb): MistralRotaryEmbedding()
    )
    (lm_head): Linear(in_features=5120, out_features=131072, bias=False)
  )
)

The text was updated successfully, but these errors were encountered:

VAmblardPEReN added the bug Something isn't working label Apr 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when computing device_map for Mistral-small-3.1-24B-Instruct-2503 #1403

Error when computing device_map for Mistral-small-3.1-24B-Instruct-2503 #1403

VAmblardPEReN commented Apr 30, 2025 •

edited

Loading

Error when computing device_map for Mistral-small-3.1-24B-Instruct-2503 #1403

Error when computing device_map for Mistral-small-3.1-24B-Instruct-2503 #1403

Comments

VAmblardPEReN commented Apr 30, 2025 • edited Loading

VAmblardPEReN commented Apr 30, 2025 •

edited

Loading