Running vllm after `oneshot` causes rerun of `oneshot` #1358

brian-dellabetta · 2025-04-16T17:44:59Z

Describe the bug
A python script with oneshot followed by running vllm (or lm_eval in "vllm" model mode) causes the oneshot operation to re-run

Expected behavior
oneshot should only be run once

Environment
Include all relevant environment information:

OS [e.g. Ubuntu 20.04]: Mac
Python version [e.g. 3.7]: 3.10
LLM Compressor version or commit hash [e.g. 0.1.0, f7245c8]: 0.5.0
ML framework version(s) [e.g. torch 2.3.1]:
Other Python package versions [e.g. vLLM, compressed-tensors, numpy, ONNX]: vllm 0.8.3
Other relevant environment information [e.g. hardware, CUDA version]:

To Reproduce

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from vllm import LLM

MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
OUTPUT_DIR = MODEL_ID.split("/")[-1] + f"-fp8"
recipe = QuantizationModifier(
    targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]
)

oneshot(
    model=MODEL_ID,
    recipe=recipe,
    output_dir=OUTPUT_DIR,
)

print("\n Done! model saved to", OUTPUT_DIR, "\n")

vmodel = LLM(OUTPUT_DIR)
vmodel.generate("The captial of the US is ")

Errors

INFO 04-16 17:36:34 [__init__.py:239] Automatically detected platform cuda.
2025-04-16:17:36:37,873 INFO     [modeling.py:990] We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.20s/it]
Repo card metadata block was not found. Setting CardData to empty.
2025-04-16:17:36:43,364 WARNING  [repocard.py:108] Repo card metadata block was not found. Setting CardData to empty.
2025-04-16T17:36:44.931178+0000 | reset | INFO - Compression lifecycle reset
Logging all LLM Compressor modifier-level logs to sparse_logs/16-04-2025_17.36.44.log
2025-04-16:17:36:44,931 INFO     [logger.py:391] Logging all LLM Compressor modifier-level logs to sparse_logs/16-04-2025_17.36.44.log
2025-04-16T17:36:44.932209+0000 | from_modifiers | INFO - Creating recipe from modifiers
2025-04-16T17:36:45.158579+0000 | _check_calibration_data | INFO - Skipping QuantizationModifier calibration, it is not required for the provided quantization config.
manager stage: Modifiers initialized
2025-04-16T17:36:47.076840+0000 | initialize | INFO - Compression lifecycle initialized for 1 modifiers
manager stage: Modifiers finalized
2025-04-16T17:36:47.078255+0000 | finalize | INFO - Compression lifecycle finalized for 1 modifiers
Checking whether model follows 2:4 sparsity structure: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 225/225 [00:09<00:00, 23.94it/s]
2025-04-16T17:38:14.285532+0000 | get_model_compressor | INFO - Inferring a sparsity configuration requires a global sparsity calculation. This can be costly for large models. To skip the calculation of compression statistics set skip_compression_stats=True
Calculating model sparsity: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 739/739 [00:07<00:00, 96.65it/s]
Quantized Compression: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 739/739 [00:51<00:00, 14.42it/s]

Done! model saved to Meta-Llama-3-8B-Instruct-awq-asym

INFO 04-16 17:39:28 [config.py:602] This model supports multiple tasks: {'generate', 'reward', 'score', 'embed', 'classify'}. Defaulting to 'generate'.
/home/bdellabe/projects/llm-compressor/src/llmcompressor/pytorch/__init__.py:19: UserWarning: torch.compile is not supported by llmcompressor for torch 2.0.x
  warnings.warn(
INFO 04-16 17:39:28 [config.py:1795] Chunked prefill is enabled with max_num_batched_tokens=16384.
WARNING 04-16 17:39:29 [utils.py:2290] We must use the `spawn` multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing for more information. Reason: CUDA is initialized
INFO 04-16 17:39:33 [__init__.py:239] Automatically detected platform cuda.
2025-04-16:17:39:36,386 INFO     [modeling.py:990] We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.18s/it]
Repo card metadata block was not found. Setting CardData to empty.
2025-04-16:17:39:41,846 WARNING  [repocard.py:108] Repo card metadata block was not found. Setting CardData to empty.
2025-04-16T17:39:43.573425+0000 | reset | INFO - Compression lifecycle reset
Logging all LLM Compressor modifier-level logs to sparse_logs/16-04-2025_17.39.43.log
2025-04-16:17:39:43,573 INFO     [logger.py:391] Logging all LLM Compressor modifier-level logs to sparse_logs/16-04-2025_17.39.43.log
2025-04-16T17:39:43.574796+0000 | from_modifiers | INFO - Creating recipe from modifiers
2025-04-16T17:39:43.648922+0000 | _check_calibration_data | INFO - Skipping QuantizationModifier calibration, it is not required for the provided quantization config.
manager stage: Modifiers initialized
2025-04-16T17:39:46.209391+0000 | initialize | INFO - Compression lifecycle initialized for 1 modifiers
manager stage: Modifiers finalized
2025-04-16T17:39:46.210970+0000 | finalize | INFO - Compression lifecycle finalized for 1 modifiers
...(continues running quantization)

The text was updated successfully, but these errors were encountered:

brian-dellabetta added the bug Something isn't working label Apr 16, 2025

brian-dellabetta changed the title ~~Running vllm after oneshot causes rerun~~ Running vllm after oneshot causes rerun of oneshot Apr 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running vllm after `oneshot` causes rerun of `oneshot` #1358

Running vllm after `oneshot` causes rerun of `oneshot` #1358

brian-dellabetta commented Apr 16, 2025 •

edited

Loading

Running vllm after oneshot causes rerun of oneshot #1358

Running vllm after oneshot causes rerun of oneshot #1358

Comments

brian-dellabetta commented Apr 16, 2025 • edited Loading

Running vllm after `oneshot` causes rerun of `oneshot` #1358

Running vllm after `oneshot` causes rerun of `oneshot` #1358

brian-dellabetta commented Apr 16, 2025 •

edited

Loading