You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When running the fp8 gemma2 example using the gemma3-27b-it model instead, the subsequent vllm / lm_eval fails due to OSError: /root/gemma-3-27b-it-FP8-Dynamic does not appear to have a file named preprocessor_config.json. Checkout 'https://huggingface.co//root/gemma-3-27b-it-FP8-Dynamic/tree/main' for available files
Expected behavior
To improve UX, I'd expect the provided samples would be self consistent without further modifications by end users.
Environment
Include all relevant environment information:
OS [e.g. Ubuntu 20.04]: Ubuntu 24.04
Python version [e.g. 3.7]: 3.12
LLM Compressor version or commit hash [e.g. 0.1.0, f7245c8]: both 0.4.1 and db91486
ML framework version(s) [e.g. torch 2.3.1]: torch 2.6.0
Other Python package versions [e.g. vLLM, compressed-tensors, numpy, ONNX]: pip install --upgrade llmcompressor==0.4.1 vllm==0.8.2 lm_eval==0.4.3 as well as huggingface_hub==0.30.1 and hf_transfer==0.1.9
Other relevant environment information [e.g. hardware, CUDA version]: GCP a3-highgpu-1g, 1x H100, 570.86.15, and CUDA 12.8
Edit: the gemma3.py comes from this sample where the only change is MODEL_ID set to google/gemma-3-27b-it.
Edit2: and replacing from llmcompressor import oneshot with from llmcompressor.transformers import oneshot
Errors
INFO 04-01 15:18:25 [__init__.py:239] Automatically detected platform cuda.
INFO 04-01 15:18:31 [config.py:585] This model supports multiple tasks: {'generate', 'embed', 'reward', 'score', 'classify'}. Defaulting to 'generate'.
INFO 04-01 15:18:32 [config.py:1697] Chunked prefill is enabled with max_num_batched_tokens=16384.
INFO 04-01 15:18:33 [core.py:54] Initializing a V1 LLM engine (v0.8.2) with config: model='/root/gemma-3-27b-it-FP8-Dynamic', speculative_config=None, tokenizer='/root/gemma-3-27b-it-FP8-Dynamic', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=/root/gemma-3-27b-it-FP8-Dynamic, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"level":3,"custom_ops":["none"],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":512}
WARNING 04-01 15:18:33 [utils.py:2321] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7da8d2e6e360>
INFO 04-01 15:18:34 [parallel_state.py:954] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-01 15:18:34 [cuda.py:220] Using Flash Attention backend on V1 engine.
ERROR 04-01 15:18:36 [core.py:343] EngineCore hit an exception: Traceback (most recent call last):
ERROR 04-01 15:18:36 [core.py:343] File "/root/llm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 335, in run_engine_core
ERROR 04-01 15:18:36 [core.py:343] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 04-01 15:18:36 [core.py:343] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343] File "/root/llm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 290, in __init__
ERROR 04-01 15:18:36 [core.py:343] super().__init__(vllm_config, executor_class, log_stats)
ERROR 04-01 15:18:36 [core.py:343] File "/root/llm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 60, in __init__
ERROR 04-01 15:18:36 [core.py:343] self.model_executor = executor_class(vllm_config)
ERROR 04-01 15:18:36 [core.py:343] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343] File "/root/llm/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 52, in __init__
ERROR 04-01 15:18:36 [core.py:343] self._init_executor()
ERROR 04-01 15:18:36 [core.py:343] File "/root/llm/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 46, in _init_executor
ERROR 04-01 15:18:36 [core.py:343] self.collective_rpc("init_device")
ERROR 04-01 15:18:36 [core.py:343] File "/root/llm/lib/python3.12/site-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
ERROR 04-01 15:18:36 [core.py:343] answer = run_method(self.driver_worker, method, args, kwargs)
ERROR 04-01 15:18:36 [core.py:343] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343] File "/root/llm/lib/python3.12/site-packages/vllm/utils.py", line 2255, in run_method
ERROR 04-01 15:18:36 [core.py:343] return func(*args, **kwargs)
ERROR 04-01 15:18:36 [core.py:343] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343] File "/root/llm/lib/python3.12/site-packages/vllm/worker/worker_base.py", line 604, in init_device
ERROR 04-01 15:18:36 [core.py:343] self.worker.init_device() # type: ignore
ERROR 04-01 15:18:36 [core.py:343] ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343] File "/root/llm/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 120, in init_device
ERROR 04-01 15:18:36 [core.py:343] self.model_runner: GPUModelRunner = GPUModelRunner(
ERROR 04-01 15:18:36 [core.py:343] ^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343] File "/root/llm/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 137, in __init__
ERROR 04-01 15:18:36 [core.py:343] encoder_compute_budget, encoder_cache_size = compute_encoder_budget(
ERROR 04-01 15:18:36 [core.py:343] ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343] File "/root/llm/lib/python3.12/site-packages/vllm/v1/core/encoder_cache_manager.py", line 92, in compute_encoder_budget
ERROR 04-01 15:18:36 [core.py:343] ) = _compute_encoder_budget_multimodal(model_config, scheduler_config)
ERROR 04-01 15:18:36 [core.py:343] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343] File "/root/llm/lib/python3.12/site-packages/vllm/v1/core/encoder_cache_manager.py", line 115, in _compute_encoder_budget_multimodal
ERROR 04-01 15:18:36 [core.py:343] max_tokens_by_modality_dict = MULTIMODAL_REGISTRY.get_max_tokens_per_item_by_nonzero_modality( # noqa: E501
ERROR 04-01 15:18:36 [core.py:343] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343] File "/root/llm/lib/python3.12/site-packages/vllm/multimodal/registry.py", line 291, in get_max_tokens_per_item_by_nonzero_modality
ERROR 04-01 15:18:36 [core.py:343] self.get_max_tokens_per_item_by_modality(model_config).items()
ERROR 04-01 15:18:36 [core.py:343] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343] File "/root/llm/lib/python3.12/site-packages/vllm/multimodal/registry.py", line 265, in get_max_tokens_per_item_by_modality
ERROR 04-01 15:18:36 [core.py:343] return processor.info.get_mm_max_tokens_per_item(
ERROR 04-01 15:18:36 [core.py:343] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343] File "/root/llm/lib/python3.12/site-packages/vllm/model_executor/models/gemma3_mm.py", line 89, in get_mm_max_tokens_per_item
ERROR 04-01 15:18:36 [core.py:343] return {"image": self.get_max_image_tokens()}
ERROR 04-01 15:18:36 [core.py:343] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343] File "/root/llm/lib/python3.12/site-packages/vllm/model_executor/models/gemma3_mm.py", line 245, in get_max_image_tokens
ERROR 04-01 15:18:36 [core.py:343] target_width, target_height = self.get_image_size_with_most_features()
ERROR 04-01 15:18:36 [core.py:343] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343] File "/root/llm/lib/python3.12/site-packages/vllm/model_executor/models/gemma3_mm.py", line 235, in get_image_size_with_most_features
ERROR 04-01 15:18:36 [core.py:343] processor = self.get_hf_processor()
ERROR 04-01 15:18:36 [core.py:343] ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343] File "/root/llm/lib/python3.12/site-packages/vllm/model_executor/models/gemma3_mm.py", line 79, in get_hf_processor
ERROR 04-01 15:18:36 [core.py:343] return self.ctx.get_hf_processor(Gemma3Processor, **kwargs)
ERROR 04-01 15:18:36 [core.py:343] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343] File "/root/llm/lib/python3.12/site-packages/vllm/inputs/registry.py", line 137, in get_hf_processor
ERROR 04-01 15:18:36 [core.py:343] return super().get_hf_processor(
ERROR 04-01 15:18:36 [core.py:343] ^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343] File "/root/llm/lib/python3.12/site-packages/vllm/inputs/registry.py", line 101, in get_hf_processor
ERROR 04-01 15:18:36 [core.py:343] return cached_processor_from_config(
ERROR 04-01 15:18:36 [core.py:343] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343] File "/root/llm/lib/python3.12/site-packages/vllm/transformers_utils/processor.py", line 106, in cached_processor_from_config
ERROR 04-01 15:18:36 [core.py:343] return cached_get_processor(
ERROR 04-01 15:18:36 [core.py:343] ^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343] File "/root/llm/lib/python3.12/site-packages/vllm/transformers_utils/processor.py", line 69, in get_processor
ERROR 04-01 15:18:36 [core.py:343] processor = processor_factory.from_pretrained(
ERROR 04-01 15:18:36 [core.py:343] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343] File "/root/llm/lib/python3.12/site-packages/transformers/processing_utils.py", line 1070, in from_pretrained
ERROR 04-01 15:18:36 [core.py:343] args = cls._get_arguments_from_pretrained(pretrained_model_name_or_path, **kwargs)
ERROR 04-01 15:18:36 [core.py:343] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343] File "/root/llm/lib/python3.12/site-packages/transformers/processing_utils.py", line 1134, in _get_arguments_from_pretrained
ERROR 04-01 15:18:36 [core.py:343] args.append(attribute_class.from_pretrained(pretrained_model_name_or_path, **kwargs))
ERROR 04-01 15:18:36 [core.py:343] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343] File "/root/llm/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py", line 465, in from_pretrained
ERROR 04-01 15:18:36 [core.py:343] raise initial_exception
ERROR 04-01 15:18:36 [core.py:343] File "/root/llm/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py", line 447, in from_pretrained
ERROR 04-01 15:18:36 [core.py:343] config_dict, _ = ImageProcessingMixin.get_image_processor_dict(
ERROR 04-01 15:18:36 [core.py:343] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343] File "/root/llm/lib/python3.12/site-packages/transformers/image_processing_base.py", line 341, in get_image_processor_dict
ERROR 04-01 15:18:36 [core.py:343] resolved_image_processor_file = cached_file(
ERROR 04-01 15:18:36 [core.py:343] ^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343] File "/root/llm/lib/python3.12/site-packages/transformers/utils/hub.py", line 266, in cached_file
ERROR 04-01 15:18:36 [core.py:343] file = cached_files(path_or_repo_id=path_or_repo_id, filenames=[filename], **kwargs)
ERROR 04-01 15:18:36 [core.py:343] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 04-01 15:18:36 [core.py:343] File "/root/llm/lib/python3.12/site-packages/transformers/utils/hub.py", line 381, in cached_files
ERROR 04-01 15:18:36 [core.py:343] raise EnvironmentError(
ERROR 04-01 15:18:36 [core.py:343] OSError: /root/gemma-3-27b-it-FP8-Dynamic does not appear to have a file named preprocessor_config.json. Checkout 'https://huggingface.co//root/gemma-3-27b-it-FP8-Dynamic/tree/main' for available files.
Additional context
Add any other context about the problem here. Also include any relevant files.
The text was updated successfully, but these errors were encountered:
for a couple different versions of lm_eval (0.4.3, 0.4.5, and 0.4.8), transformers 4.50.0, and torch 2.6.0, on python 3.10.12.
Your stack trace includes some calls into multi-modal code in vllm. That is why it is looking for a preprocessor instead of the tokenizor.json / tokenizer_config.json that do get saved. We are only using gsm8k in lm_eval, a pure language task.
I'm not sure why that is happening for you and not for me. Could the HF upload/download be causing issues? If you just run the script in examples directly, without HF hub interaction, does it still fail?
UPDATE: some confusing naming/versioning -- gemma 2 models are purely text and gemma 3 are multi-modal, which is a different beast entirely from causal LM. It's weird though, I don't see any vision encoder when inspecting google/gemma-3-4b-it. According to the whitepaper, the 4B is also multi-modal
Describe the bug
When running the fp8 gemma2 example using the gemma3-27b-it model instead, the subsequent vllm / lm_eval fails due to
OSError: /root/gemma-3-27b-it-FP8-Dynamic does not appear to have a file named preprocessor_config.json. Checkout 'https://huggingface.co//root/gemma-3-27b-it-FP8-Dynamic/tree/main' for available files
Expected behavior
To improve UX, I'd expect the provided samples would be self consistent without further modifications by end users.
Environment
Include all relevant environment information:
f7245c8
]: both 0.4.1 anddb91486
pip install --upgrade llmcompressor==0.4.1 vllm==0.8.2 lm_eval==0.4.3
as well ashuggingface_hub==0.30.1
andhf_transfer==0.1.9
570.86.15
, and CUDA12.8
To Reproduce
Edit: the
gemma3.py
comes from this sample where the only change isMODEL_ID
set togoogle/gemma-3-27b-it
.Edit2: and replacing
from llmcompressor import oneshot
withfrom llmcompressor.transformers import oneshot
Errors
Additional context
Add any other context about the problem here. Also include any relevant files.
The text was updated successfully, but these errors were encountered: