Skip to content

[Gemma3] - oneshot doesn't output preprocessor_config.json & processor_config.json #1305

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
m4r1k opened this issue Apr 1, 2025 · 1 comment
Assignees
Labels
bug Something isn't working

Comments

@m4r1k
Copy link

m4r1k commented Apr 1, 2025

Describe the bug
When running the fp8 gemma2 example using the gemma3-27b-it model instead, the subsequent vllm / lm_eval fails due to OSError: /root/gemma-3-27b-it-FP8-Dynamic does not appear to have a file named preprocessor_config.json. Checkout 'https://huggingface.co//root/gemma-3-27b-it-FP8-Dynamic/tree/main' for available files

Expected behavior
To improve UX, I'd expect the provided samples would be self consistent without further modifications by end users.

Environment
Include all relevant environment information:

  1. OS [e.g. Ubuntu 20.04]: Ubuntu 24.04
  2. Python version [e.g. 3.7]: 3.12
  3. LLM Compressor version or commit hash [e.g. 0.1.0, f7245c8]: both 0.4.1 and db91486
  4. ML framework version(s) [e.g. torch 2.3.1]: torch 2.6.0
  5. Other Python package versions [e.g. vLLM, compressed-tensors, numpy, ONNX]: pip install --upgrade llmcompressor==0.4.1 vllm==0.8.2 lm_eval==0.4.3 as well as huggingface_hub==0.30.1 and hf_transfer==0.1.9
  6. Other relevant environment information [e.g. hardware, CUDA version]: GCP a3-highgpu-1g, 1x H100, 570.86.15, and CUDA 12.8

To Reproduce

#!/bin/bash

apt update
apt install -y python3-virtualenv
virtualenv llm
source llm/bin/activate

pip install --upgrade llmcompressor==0.4.1 vllm==0.8.2 lm_eval==0.4.3
pip install --upgrade huggingface_hub[hf_transfer]

export HF_HUB_ENABLE_HF_TRANSFER=1
export HF_TOKEN=<HF Token>
export HF_HOME=/models

mkdir -p /models

_MODEL=google/gemma-3-27b-it
_QUANTIZED_MODEL=/root/gemma-3-27b-it-FP8-Dynamic

huggingface-cli download "${_MODEL}"

export CUDA_VISIBLE_DEVICES=0
python3 gemma3.py

python3 - << EOF
from vllm import LLM
model = LLM("${_MODEL}")
model.generate("Hello my name is")
EOF

lm_eval \
  --model vllm \
  --model_args pretrained=${_MODEL},add_bos_token=True,max_model_len=4096 \
  --tasks gsm8k --num_fewshot 5 --batch_size auto --limit 250

python3 - << EOF
from vllm import LLM
model = LLM("${_QUANTIZED_MODEL}")
model.generate("Hello my name is")
EOF

lm_eval \
  --model vllm \
  --model_args pretrained=${_QUANTIZED_MODEL},add_bos_token=True,max_model_len=4096 \
  --tasks gsm8k --num_fewshot 5 --batch_size auto --limit 250

Edit: the gemma3.py comes from this sample where the only change is MODEL_ID set to google/gemma-3-27b-it.
Edit2: and replacing from llmcompressor import oneshot with from llmcompressor.transformers import oneshot

Errors

INFO 04-01 15:18:25 [__init__.py:239] Automatically detected platform cuda.
INFO 04-01 15:18:31 [config.py:585] This model supports multiple tasks: {'generate', 'embed', 'reward', 'score', 'classify'}. Defaulting to 'generate'.
INFO 04-01 15:18:32 [config.py:1697] Chunked prefill is enabled with max_num_batched_tokens=16384.
INFO 04-01 15:18:33 [core.py:54] Initializing a V1 LLM engine (v0.8.2) with config: model='/root/gemma-3-27b-it-FP8-Dynamic', speculative_config=None, tokenizer='/root/gemma-3-27b-it-FP8-Dynamic', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/root/gemma-3-27b-it-FP8-Dynamic, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 04-01 15:18:33 [utils.py:2321] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7da8d2e6e360>
INFO 04-01 15:18:34 [parallel_state.py:954] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-01 15:18:34 [cuda.py:220] Using Flash Attention backend on V1 engine.
ERROR 04-01 15:18:36 [core.py:343] EngineCore hit an exception: Traceback (most recent call last):
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 335, in run_engine_core
ERROR 04-01 15:18:36 [core.py:343]     engine_core = EngineCoreProc(*args, **kwargs)
ERROR 04-01 15:18:36 [core.py:343]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 290, in __init__
ERROR 04-01 15:18:36 [core.py:343]     super().__init__(vllm_config, executor_class, log_stats)
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 60, in __init__
ERROR 04-01 15:18:36 [core.py:343]     self.model_executor = executor_class(vllm_config)
ERROR 04-01 15:18:36 [core.py:343]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 52, in __init__
ERROR 04-01 15:18:36 [core.py:343]     self._init_executor()
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
ERROR 04-01 15:18:36 [core.py:343]     self.collective_rpc("init_device")
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 04-01 15:18:36 [core.py:343]     answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 04-01 15:18:36 [core.py:343]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/utils.py", line 2255, in run_method
ERROR 04-01 15:18:36 [core.py:343]     return func(*args, **kwargs)
ERROR 04-01 15:18:36 [core.py:343]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 604, in init_device
ERROR 04-01 15:18:36 [core.py:343]     self.worker.init_device()  # type: ignore
ERROR 04-01 15:18:36 [core.py:343]     ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 120, in init_device
ERROR 04-01 15:18:36 [core.py:343]     self.model_runner: GPUModelRunner = GPUModelRunner(
ERROR 04-01 15:18:36 [core.py:343]                                         ^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 137, in __init__
ERROR 04-01 15:18:36 [core.py:343]     encoder_compute_budget, encoder_cache_size = compute_encoder_budget(
ERROR 04-01 15:18:36 [core.py:343]                                                  ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/v1/core/encoder_cache_manager.py", line 92, in compute_encoder_budget
ERROR 04-01 15:18:36 [core.py:343]     ) = _compute_encoder_budget_multimodal(model_config, scheduler_config)
ERROR 04-01 15:18:36 [core.py:343]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/v1/core/encoder_cache_manager.py", line 115, in _compute_encoder_budget_multimodal
ERROR 04-01 15:18:36 [core.py:343]     max_tokens_by_modality_dict = MULTIMODAL_REGISTRY.get_max_tokens_per_item_by_nonzero_modality(  # noqa: E501
ERROR 04-01 15:18:36 [core.py:343]                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/multimodal/registry.py", line 291, in get_max_tokens_per_item_by_nonzero_modality
ERROR 04-01 15:18:36 [core.py:343]     self.get_max_tokens_per_item_by_modality(model_config).items()
ERROR 04-01 15:18:36 [core.py:343]     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/multimodal/registry.py", line 265, in get_max_tokens_per_item_by_modality
ERROR 04-01 15:18:36 [core.py:343]     return processor.info.get_mm_max_tokens_per_item(
ERROR 04-01 15:18:36 [core.py:343]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/model_executor/models/gemma3_mm.py", line 89, in get_mm_max_tokens_per_item
ERROR 04-01 15:18:36 [core.py:343]     return {"image": self.get_max_image_tokens()}
ERROR 04-01 15:18:36 [core.py:343]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/model_executor/models/gemma3_mm.py", line 245, in get_max_image_tokens
ERROR 04-01 15:18:36 [core.py:343]     target_width, target_height = self.get_image_size_with_most_features()
ERROR 04-01 15:18:36 [core.py:343]                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/model_executor/models/gemma3_mm.py", line 235, in get_image_size_with_most_features
ERROR 04-01 15:18:36 [core.py:343]     processor = self.get_hf_processor()
ERROR 04-01 15:18:36 [core.py:343]                 ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/model_executor/models/gemma3_mm.py", line 79, in get_hf_processor
ERROR 04-01 15:18:36 [core.py:343]     return self.ctx.get_hf_processor(Gemma3Processor, **kwargs)
ERROR 04-01 15:18:36 [core.py:343]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/inputs/registry.py", line 137, in get_hf_processor
ERROR 04-01 15:18:36 [core.py:343]     return super().get_hf_processor(
ERROR 04-01 15:18:36 [core.py:343]            ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/inputs/registry.py", line 101, in get_hf_processor
ERROR 04-01 15:18:36 [core.py:343]     return cached_processor_from_config(
ERROR 04-01 15:18:36 [core.py:343]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/transformers_utils/processor.py", line 106, in cached_processor_from_config
ERROR 04-01 15:18:36 [core.py:343]     return cached_get_processor(
ERROR 04-01 15:18:36 [core.py:343]            ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/vllm/transformers_utils/processor.py", line 69, in get_processor
ERROR 04-01 15:18:36 [core.py:343]     processor = processor_factory.from_pretrained(
ERROR 04-01 15:18:36 [core.py:343]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/transformers/processing_utils.py", line 1070, in from_pretrained
ERROR 04-01 15:18:36 [core.py:343]     args = cls._get_arguments_from_pretrained(pretrained_model_name_or_path, **kwargs)
ERROR 04-01 15:18:36 [core.py:343]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/transformers/processing_utils.py", line 1134, in _get_arguments_from_pretrained
ERROR 04-01 15:18:36 [core.py:343]     args.append(attribute_class.from_pretrained(pretrained_model_name_or_path, **kwargs))
ERROR 04-01 15:18:36 [core.py:343]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py", line 465, in from_pretrained
ERROR 04-01 15:18:36 [core.py:343]     raise initial_exception
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py", line 447, in from_pretrained
ERROR 04-01 15:18:36 [core.py:343]     config_dict, _ = ImageProcessingMixin.get_image_processor_dict(
ERROR 04-01 15:18:36 [core.py:343]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/transformers/image_processing_base.py", line 341, in get_image_processor_dict
ERROR 04-01 15:18:36 [core.py:343]     resolved_image_processor_file = cached_file(
ERROR 04-01 15:18:36 [core.py:343]                                     ^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/transformers/utils/hub.py", line 266, in cached_file
ERROR 04-01 15:18:36 [core.py:343]     file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
ERROR 04-01 15:18:36 [core.py:343]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343]   File "/root/llm/lib/python3.12/site-packages/transformers/utils/hub.py", line 381, in cached_files
ERROR 04-01 15:18:36 [core.py:343]     raise EnvironmentError(
ERROR 04-01 15:18:36 [core.py:343] OSError: /root/gemma-3-27b-it-FP8-Dynamic does not appear to have a file named preprocessor_config.json. Checkout 'https://huggingface.co//root/gemma-3-27b-it-FP8-Dynamic/tree/main' for available files.

Additional context
Add any other context about the problem here. Also include any relevant files.

@m4r1k m4r1k added the bug Something isn't working label Apr 1, 2025
@brian-dellabetta brian-dellabetta self-assigned this Apr 2, 2025
@brian-dellabetta
Copy link
Collaborator

brian-dellabetta commented Apr 2, 2025

Hi @m4r1k , I am able to run the example script you pointed to with the smaller google/gemma-3-4b-it, without issue. I was also able to run

lm_eval \
  --model vllm \
  --model_args pretrained=/path/to/gemma-3-4b-it-FP8-Dynamic,add_bos_token=True,max_model_len=4096 \
  --tasks gsm8k --num_fewshot 5 --batch_size auto --limit 250

for a couple different versions of lm_eval (0.4.3, 0.4.5, and 0.4.8), transformers 4.50.0, and torch 2.6.0, on python 3.10.12.

Your stack trace includes some calls into multi-modal code in vllm. That is why it is looking for a preprocessor instead of the tokenizor.json / tokenizer_config.json that do get saved. We are only using gsm8k in lm_eval, a pure language task.

I'm not sure why that is happening for you and not for me. Could the HF upload/download be causing issues? If you just run the script in examples directly, without HF hub interaction, does it still fail?

UPDATE: some confusing naming/versioning -- gemma 2 models are purely text and gemma 3 are multi-modal, which is a different beast entirely from causal LM. It's weird though, I don't see any vision encoder when inspecting google/gemma-3-4b-it. According to the whitepaper, the 4B is also multi-modal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants