[Gemma3] - oneshot doesn't output preprocessor_config.json & processor_config.json #1305

m4r1k · 2025-04-01T17:52:30Z

Describe the bug
When running the fp8 gemma2 example using the gemma3-27b-it model instead, the subsequent vllm / lm_eval fails due to OSError: /root/gemma-3-27b-it-FP8-Dynamic does not appear to have a file named preprocessor_config.json. Checkout 'https://huggingface.co//root/gemma-3-27b-it-FP8-Dynamic/tree/main' for available files

Expected behavior
To improve UX, I'd expect the provided samples would be self consistent without further modifications by end users.

Environment
Include all relevant environment information:

OS [e.g. Ubuntu 20.04]: Ubuntu 24.04
Python version [e.g. 3.7]: 3.12
LLM Compressor version or commit hash [e.g. 0.1.0, f7245c8]: both 0.4.1 and db91486
ML framework version(s) [e.g. torch 2.3.1]: torch 2.6.0
Other Python package versions [e.g. vLLM, compressed-tensors, numpy, ONNX]: pip install --upgrade llmcompressor==0.4.1 vllm==0.8.2 lm_eval==0.4.3 as well as huggingface_hub==0.30.1 and hf_transfer==0.1.9
Other relevant environment information [e.g. hardware, CUDA version]: GCP a3-highgpu-1g, 1x H100, 570.86.15, and CUDA 12.8

To Reproduce

#!/bin/bash

apt update
apt install -y python3-virtualenv
virtualenv llm
source llm/bin/activate

pip install --upgrade llmcompressor==0.4.1 vllm==0.8.2 lm_eval==0.4.3
pip install --upgrade huggingface_hub[hf_transfer]

export HF_HUB_ENABLE_HF_TRANSFER=1
export HF_TOKEN=<HF Token>
export HF_HOME=/models

mkdir -p /models

_MODEL=google/gemma-3-27b-it
_QUANTIZED_MODEL=/root/gemma-3-27b-it-FP8-Dynamic

huggingface-cli download "${_MODEL}"

export CUDA_VISIBLE_DEVICES=0
python3 gemma3.py

python3 - << EOF
from vllm import LLM
model = LLM("${_MODEL}")
model.generate("Hello my name is")
EOF

lm_eval \
  --model vllm \
  --model_args pretrained=${_MODEL},add_bos_token=True,max_model_len=4096 \
  --tasks gsm8k --num_fewshot 5 --batch_size auto --limit 250

python3 - << EOF
from vllm import LLM
model = LLM("${_QUANTIZED_MODEL}")
model.generate("Hello my name is")
EOF

lm_eval \
  --model vllm \
  --model_args pretrained=${_QUANTIZED_MODEL},add_bos_token=True,max_model_len=4096 \
  --tasks gsm8k --num_fewshot 5 --batch_size auto --limit 250

Edit: the gemma3.py comes from this sample where the only change is MODEL_ID set to google/gemma-3-27b-it.
Edit2: and replacing from llmcompressor import oneshot with from llmcompressor.transformers import oneshot

Errors

INFO 04-01 15:18:25 [__init__.py:239] Automatically detected platform cuda.
INFO 04-01 15:18:31 [config.py:585] This model supports multiple tasks: {'generate', 'embed', 'reward', 'score', 'classify'}. Defaulting to 'generate'.
INFO 04-01 15:18:32 [config.py:1697] Chunked prefill is enabled with max_num_batched_tokens=16384.
INFO 04-01 15:18:33 [core.py:54] Initializing a V1 LLM engine (v0.8.2) with config: model='/root/gemma-3-27b-it-FP8-Dynamic', speculative_config=None, tokenizer='/root/gemma-3-27b-it-FP8-Dynamic', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/root/gemma-3-27b-it-FP8-Dynamic, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 04-01 15:18:33 [utils.py:2321] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7da8d2e6e360>
INFO 04-01 15:18:34 [parallel_state.py:954] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-01 15:18:34 [cuda.py:220] Using Flash Attention backend on V1 engine.
ERROR 04-01 15:18:36 [core.py:343] EngineCore hit an exception: Traceback (most recent call last):
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 335, in run_engine_core
ERROR 04-01 15:18:36 [core.py:343]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 04-01 15:18:36 [core.py:343]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 290, in __init__
ERROR 04-01 15:18:36 [core.py:343]     super().__init__(vllm_config, executor_class, log_stats)
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 60, in __init__
ERROR 04-01 15:18:36 [core.py:343]     self.model_executor = executor_class(vllm_config)
ERROR 04-01 15:18:36 [core.py:343]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 52, in __init__
ERROR 04-01 15:18:36 [core.py:343]     self._init_executor()
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
ERROR 04-01 15:18:36 [core.py:343]     self.collective_rpc("init_device")
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 04-01 15:18:36 [core.py:343]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 04-01 15:18:36 [core.py:343]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/utils.py", line 2255, in run_method
ERROR 04-01 15:18:36 [core.py:343]     return func(*args, **kwargs)
ERROR 04-01 15:18:36 [core.py:343]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 604, in init_device
ERROR 04-01 15:18:36 [core.py:343]     self.worker.init_device()  # type: ignore
ERROR 04-01 15:18:36 [core.py:343]     ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 120, in init_device
ERROR 04-01 15:18:36 [core.py:343]     self.model_runner: GPUModelRunner = GPUModelRunner(
ERROR 04-01 15:18:36 [core.py:343]                                         ^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 137, in __init__
ERROR 04-01 15:18:36 [core.py:343]     encoder_compute_budget, encoder_cache_size = compute_encoder_budget(
ERROR 04-01 15:18:36 [core.py:343]                                                  ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/v1/core/encoder_cache_manager.py", line 92, in compute_encoder_budget
ERROR 04-01 15:18:36 [core.py:343]     ) = _compute_encoder_budget_multimodal(model_config, scheduler_config)
ERROR 04-01 15:18:36 [core.py:343]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/v1/core/encoder_cache_manager.py", line 115, in _compute_encoder_budget_multimodal
ERROR 04-01 15:18:36 [core.py:343]     max_tokens_by_modality_dict = MULTIMODAL_REGISTRY.get_max_tokens_per_item_by_nonzero_modality(  # noqa: E501
ERROR 04-01 15:18:36 [core.py:343]                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/multimodal/registry.py", line 291, in get_max_tokens_per_item_by_nonzero_modality
ERROR 04-01 15:18:36 [core.py:343]     self.get_max_tokens_per_item_by_modality(model_config).items()
ERROR 04-01 15:18:36 [core.py:343]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/multimodal/registry.py", line 265, in get_max_tokens_per_item_by_modality
ERROR 04-01 15:18:36 [core.py:343]     return processor.info.get_mm_max_tokens_per_item(
ERROR 04-01 15:18:36 [core.py:343]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/model_executor/models/gemma3_mm.py", line 89, in get_mm_max_tokens_per_item
ERROR 04-01 15:18:36 [core.py:343]     return {"image": self.get_max_image_tokens()}
ERROR 04-01 15:18:36 [core.py:343]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/model_executor/models/gemma3_mm.py", line 245, in get_max_image_tokens
ERROR 04-01 15:18:36 [core.py:343]     target_width, target_height = self.get_image_size_with_most_features()
ERROR 04-01 15:18:36 [core.py:343]                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/model_executor/models/gemma3_mm.py", line 235, in get_image_size_with_most_features
ERROR 04-01 15:18:36 [core.py:343]     processor = self.get_hf_processor()
ERROR 04-01 15:18:36 [core.py:343]                 ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/model_executor/models/gemma3_mm.py", line 79, in get_hf_processor
ERROR 04-01 15:18:36 [core.py:343]     return self.ctx.get_hf_processor(Gemma3Processor, **kwargs)
ERROR 04-01 15:18:36 [core.py:343]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/inputs/registry.py", line 137, in get_hf_processor
ERROR 04-01 15:18:36 [core.py:343]     return super().get_hf_processor(
ERROR 04-01 15:18:36 [core.py:343]            ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/inputs/registry.py", line 101, in get_hf_processor
ERROR 04-01 15:18:36 [core.py:343]     return cached_processor_from_config(
ERROR 04-01 15:18:36 [core.py:343]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/transformers_utils/processor.py", line 106, in cached_processor_from_config
ERROR 04-01 15:18:36 [core.py:343]     return cached_get_processor(
ERROR 04-01 15:18:36 [core.py:343]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/transformers_utils/processor.py", line 69, in get_processor
ERROR 04-01 15:18:36 [core.py:343]     processor = processor_factory.from_pretrained(
ERROR 04-01 15:18:36 [core.py:343]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/transformers/processing_utils.py", line 1070, in from_pretrained
ERROR 04-01 15:18:36 [core.py:343]     args = cls._get_arguments_from_pretrained(pretrained_model_name_or_path, **kwargs)
ERROR 04-01 15:18:36 [core.py:343]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/transformers/processing_utils.py", line 1134, in _get_arguments_from_pretrained
ERROR 04-01 15:18:36 [core.py:343]     args.append(attribute_class.from_pretrained(pretrained_model_name_or_path, **kwargs))
ERROR 04-01 15:18:36 [core.py:343]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py", line 465, in from_pretrained
ERROR 04-01 15:18:36 [core.py:343]     raise initial_exception
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py", line 447, in from_pretrained
ERROR 04-01 15:18:36 [core.py:343]     config_dict, _ = ImageProcessingMixin.get_image_processor_dict(
ERROR 04-01 15:18:36 [core.py:343]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/transformers/image_processing_base.py", line 341, in get_image_processor_dict
ERROR 04-01 15:18:36 [core.py:343]     resolved_image_processor_file = cached_file(
ERROR 04-01 15:18:36 [core.py:343]                                     ^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/transformers/utils/hub.py", line 266, in cached_file
ERROR 04-01 15:18:36 [core.py:343]     file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
ERROR 04-01 15:18:36 [core.py:343]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/transformers/utils/hub.py", line 381, in cached_files
ERROR 04-01 15:18:36 [core.py:343]     raise EnvironmentError(
ERROR 04-01 15:18:36 [core.py:343] OSError: /root/gemma-3-27b-it-FP8-Dynamic does not appear to have a file named preprocessor_config.json. Checkout 'https://huggingface.co//root/gemma-3-27b-it-FP8-Dynamic/tree/main' for available files.

Additional context
Add any other context about the problem here. Also include any relevant files.

The text was updated successfully, but these errors were encountered:

brian-dellabetta · 2025-04-02T19:07:59Z

Hi @m4r1k , I am able to run the example script you pointed to with the smaller google/gemma-3-4b-it, without issue. I was also able to run

lm_eval \
  --model vllm \
  --model_args pretrained=/path/to/gemma-3-4b-it-FP8-Dynamic,add_bos_token=True,max_model_len=4096 \
  --tasks gsm8k --num_fewshot 5 --batch_size auto --limit 250

for a couple different versions of lm_eval (0.4.3, 0.4.5, and 0.4.8), transformers 4.50.0, and torch 2.6.0, on python 3.10.12.

Your stack trace includes some calls into multi-modal code in vllm. That is why it is looking for a preprocessor instead of the tokenizor.json / tokenizer_config.json that do get saved. We are only using gsm8k in lm_eval, a pure language task.

I'm not sure why that is happening for you and not for me. Could the HF upload/download be causing issues? If you just run the script in examples directly, without HF hub interaction, does it still fail?

UPDATE: some confusing naming/versioning -- gemma 2 models are purely text and gemma 3 are multi-modal, which is a different beast entirely from causal LM. It's weird though, I don't see any vision encoder when inspecting google/gemma-3-4b-it. According to the whitepaper, the 4B is also multi-modal

m4r1k added the bug Something isn't working label Apr 1, 2025

brian-dellabetta self-assigned this Apr 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Gemma3] - oneshot doesn't output preprocessor_config.json & processor_config.json #1305

[Gemma3] - oneshot doesn't output preprocessor_config.json & processor_config.json #1305

m4r1k commented Apr 1, 2025 •

edited

Loading

brian-dellabetta commented Apr 2, 2025 •

edited

Loading

[Gemma3] - oneshot doesn't output preprocessor_config.json & processor_config.json #1305

[Gemma3] - oneshot doesn't output preprocessor_config.json & processor_config.json #1305

Comments

m4r1k commented Apr 1, 2025 • edited Loading

brian-dellabetta commented Apr 2, 2025 • edited Loading

m4r1k commented Apr 1, 2025 •

edited

Loading

brian-dellabetta commented Apr 2, 2025 •

edited

Loading