We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
oneshot
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Describe the bug A python script with oneshot followed by running vllm (or lm_eval in "vllm" model mode) causes the oneshot operation to re-run
vllm
lm_eval
"vllm"
Expected behavior oneshot should only be run once
Environment Include all relevant environment information:
f7245c8
To Reproduce
from llmcompressor import oneshot from llmcompressor.modifiers.quantization import QuantizationModifier from vllm import LLM MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct" OUTPUT_DIR = MODEL_ID.split("/")[-1] + f"-fp8" recipe = QuantizationModifier( targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"] ) oneshot( model=MODEL_ID, recipe=recipe, output_dir=OUTPUT_DIR, ) print("\n Done! model saved to", OUTPUT_DIR, "\n") vmodel = LLM(OUTPUT_DIR) vmodel.generate("The captial of the US is ")
Errors
INFO 04-16 17:36:34 [__init__.py:239] Automatically detected platform cuda. 2025-04-16:17:36:37,873 INFO [modeling.py:990] We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk). Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00, 1.20s/it] Repo card metadata block was not found. Setting CardData to empty. 2025-04-16:17:36:43,364 WARNING [repocard.py:108] Repo card metadata block was not found. Setting CardData to empty. 2025-04-16T17:36:44.931178+0000 | reset | INFO - Compression lifecycle reset Logging all LLM Compressor modifier-level logs to sparse_logs/16-04-2025_17.36.44.log 2025-04-16:17:36:44,931 INFO [logger.py:391] Logging all LLM Compressor modifier-level logs to sparse_logs/16-04-2025_17.36.44.log 2025-04-16T17:36:44.932209+0000 | from_modifiers | INFO - Creating recipe from modifiers 2025-04-16T17:36:45.158579+0000 | _check_calibration_data | INFO - Skipping QuantizationModifier calibration, it is not required for the provided quantization config. manager stage: Modifiers initialized 2025-04-16T17:36:47.076840+0000 | initialize | INFO - Compression lifecycle initialized for 1 modifiers manager stage: Modifiers finalized 2025-04-16T17:36:47.078255+0000 | finalize | INFO - Compression lifecycle finalized for 1 modifiers Checking whether model follows 2:4 sparsity structure: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 225/225 [00:09<00:00, 23.94it/s] 2025-04-16T17:38:14.285532+0000 | get_model_compressor | INFO - Inferring a sparsity configuration requires a global sparsity calculation. This can be costly for large models. To skip the calculation of compression statistics set skip_compression_stats=True Calculating model sparsity: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 739/739 [00:07<00:00, 96.65it/s] Quantized Compression: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 739/739 [00:51<00:00, 14.42it/s] Done! model saved to Meta-Llama-3-8B-Instruct-awq-asym INFO 04-16 17:39:28 [config.py:602] This model supports multiple tasks: {'generate', 'reward', 'score', 'embed', 'classify'}. Defaulting to 'generate'. /home/bdellabe/projects/llm-compressor/src/llmcompressor/pytorch/__init__.py:19: UserWarning: torch.compile is not supported by llmcompressor for torch 2.0.x warnings.warn( INFO 04-16 17:39:28 [config.py:1795] Chunked prefill is enabled with max_num_batched_tokens=16384. WARNING 04-16 17:39:29 [utils.py:2290] We must use the `spawn` multiprocessing start method. Overriding VLLM_WORKER_MULTIPROC_METHOD to 'spawn'. See https://docs.vllm.ai/en/latest/getting_started/troubleshooting.html#python-multiprocessing for more information. Reason: CUDA is initialized INFO 04-16 17:39:33 [__init__.py:239] Automatically detected platform cuda. 2025-04-16:17:39:36,386 INFO [modeling.py:990] We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk). Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00, 1.18s/it] Repo card metadata block was not found. Setting CardData to empty. 2025-04-16:17:39:41,846 WARNING [repocard.py:108] Repo card metadata block was not found. Setting CardData to empty. 2025-04-16T17:39:43.573425+0000 | reset | INFO - Compression lifecycle reset Logging all LLM Compressor modifier-level logs to sparse_logs/16-04-2025_17.39.43.log 2025-04-16:17:39:43,573 INFO [logger.py:391] Logging all LLM Compressor modifier-level logs to sparse_logs/16-04-2025_17.39.43.log 2025-04-16T17:39:43.574796+0000 | from_modifiers | INFO - Creating recipe from modifiers 2025-04-16T17:39:43.648922+0000 | _check_calibration_data | INFO - Skipping QuantizationModifier calibration, it is not required for the provided quantization config. manager stage: Modifiers initialized 2025-04-16T17:39:46.209391+0000 | initialize | INFO - Compression lifecycle initialized for 1 modifiers manager stage: Modifiers finalized 2025-04-16T17:39:46.210970+0000 | finalize | INFO - Compression lifecycle finalized for 1 modifiers ...(continues running quantization)
The text was updated successfully, but these errors were encountered:
No branches or pull requests
Describe the bug
A python script with
oneshot
followed by runningvllm
(orlm_eval
in"vllm"
model mode) causes the oneshot operation to re-runExpected behavior
oneshot
should only be run onceEnvironment
Include all relevant environment information:
f7245c8
]: 0.5.0To Reproduce
Errors
The text was updated successfully, but these errors were encountered: