Skip to content

Running vllm after oneshot causes rerun of oneshot #1358

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
brian-dellabetta opened this issue Apr 16, 2025 · 0 comments
Open

Running vllm after oneshot causes rerun of oneshot #1358

brian-dellabetta opened this issue Apr 16, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@brian-dellabetta
Copy link
Collaborator

brian-dellabetta commented Apr 16, 2025

Describe the bug
A python script with oneshot followed by running vllm (or lm_eval in "vllm" model mode) causes the oneshot operation to re-run

Expected behavior
oneshot should only be run once

Environment
Include all relevant environment information:

  1. OS [e.g. Ubuntu 20.04]: Mac
  2. Python version [e.g. 3.7]: 3.10
  3. LLM Compressor version or commit hash [e.g. 0.1.0, f7245c8]: 0.5.0
  4. ML framework version(s) [e.g. torch 2.3.1]:
  5. Other Python package versions [e.g. vLLM, compressed-tensors, numpy, ONNX]: vllm 0.8.3
  6. Other relevant environment information [e.g. hardware, CUDA version]:

To Reproduce

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from vllm import LLM

MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
OUTPUT_DIR = MODEL_ID.split("/")[-1] + f"-fp8"
recipe = QuantizationModifier(
    targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]
)

oneshot(
    model=MODEL_ID,
    recipe=recipe,
    output_dir=OUTPUT_DIR,
)

print("\n Done! model saved to", OUTPUT_DIR, "\n")

vmodel = LLM(OUTPUT_DIR)
vmodel.generate("The captial of the US is ")

Errors

INFO 04-16 17:36:34 [__init__.py:239] Automatically detected platform cuda.
2025-04-16:17:36:37,873 INFO     [modeling.py:990] We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.20s/it]
Repo card metadata block was not found. Setting CardData to empty.
2025-04-16:17:36:43,364 WARNING  [repocard.py:108] Repo card metadata block was not found. Setting CardData to empty.
2025-04-16T17:36:44.931178+0000 | reset | INFO - Compression lifecycle reset
Logging all LLM Compressor modifier-level logs to sparse_logs/16-04-2025_17.36.44.log
2025-04-16:17:36:44,931 INFO     [logger.py:391] Logging all LLM Compressor modifier-level logs to sparse_logs/16-04-2025_17.36.44.log
2025-04-16T17:36:44.932209+0000 | from_modifiers | INFO - Creating recipe from modifiers
2025-04-16T17:36:45.158579+0000 | _check_calibration_data | INFO - Skipping QuantizationModifier calibration, it is not required for the provided quantization config.
manager stage: Modifiers initialized
2025-04-16T17:36:47.076840+0000 | initialize | INFO - Compression lifecycle initialized for 1 modifiers
manager stage: Modifiers finalized
2025-04-16T17:36:47.078255+0000 | finalize | INFO - Compression lifecycle finalized for 1 modifiers
Checking whether model follows 2:4 sparsity structure: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 225/225 [00:09<00:00, 23.94it/s]
2025-04-16T17:38:14.285532+0000 | get_model_compressor | INFO - Inferring a sparsity configuration requires a global sparsity calculation. This can be costly for large models. To skip the calculation of compression statistics set skip_compression_stats=True
Calculating model sparsity: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 739/739 [00:07<00:00, 96.65it/s]
Quantized Compression: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 739/739 [00:51<00:00, 14.42it/s]

Done! model saved to Meta-Llama-3-8B-Instruct-awq-asym

INFO 04-16 17:39:28 [config.py:602] This model supports multiple tasks: {'generate', 'reward', 'score', 'embed', 'classify'}. Defaulting to 'generate'.
/home/bdellabe/projects/llm-compressor/src/llmcompressor/pytorch/__init__.py:19: UserWarning: torch.compile is not supported by llmcompressor for torch 2.0.x
  warnings.warn(
INFO 04-16 17:39:28 [config.py:1795] Chunked prefill is enabled with max_num_batched_tokens=16384.
WARNING 04-16 17:39:29 [utils.py:2290] We must use the `spawn` multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing for more information. Reason: CUDA is initialized
INFO 04-16 17:39:33 [__init__.py:239] Automatically detected platform cuda.
2025-04-16:17:39:36,386 INFO     [modeling.py:990] We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.18s/it]
Repo card metadata block was not found. Setting CardData to empty.
2025-04-16:17:39:41,846 WARNING  [repocard.py:108] Repo card metadata block was not found. Setting CardData to empty.
2025-04-16T17:39:43.573425+0000 | reset | INFO - Compression lifecycle reset
Logging all LLM Compressor modifier-level logs to sparse_logs/16-04-2025_17.39.43.log
2025-04-16:17:39:43,573 INFO     [logger.py:391] Logging all LLM Compressor modifier-level logs to sparse_logs/16-04-2025_17.39.43.log
2025-04-16T17:39:43.574796+0000 | from_modifiers | INFO - Creating recipe from modifiers
2025-04-16T17:39:43.648922+0000 | _check_calibration_data | INFO - Skipping QuantizationModifier calibration, it is not required for the provided quantization config.
manager stage: Modifiers initialized
2025-04-16T17:39:46.209391+0000 | initialize | INFO - Compression lifecycle initialized for 1 modifiers
manager stage: Modifiers finalized
2025-04-16T17:39:46.210970+0000 | finalize | INFO - Compression lifecycle finalized for 1 modifiers
...(continues running quantization)
@brian-dellabetta brian-dellabetta added the bug Something isn't working label Apr 16, 2025
@brian-dellabetta brian-dellabetta changed the title Running vllm after oneshot causes rerun Running vllm after oneshot causes rerun of oneshot Apr 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant