use cpu memory 1.4TB, but use gpu memory 300MB #1240

mmdbhs · 2025-03-11T01:58:36Z

Describe the bug
I try to quantify deepseek-v3-BF16
when i use this code :

import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import oneshot
from llmcompressor.transformers.compression.helpers import calculate_offload_device_map

# NOTE: transformers 4.49.0 has an attribute error with DeepSeek.
# Please consider either downgrading your transformers version to a
# previous version or upgrading to a version where this bug is fixed

# select a Mixture of Experts model for quantization
MODEL_ID = "DeepSeek-V3-bf16"

# adjust based off number of desired GPUs
# if not enough memory is available, some layers will automatically be offlaoded to cpu
device_map = calculate_offload_device_map(
    MODEL_ID,
    reserve_for_hessians=True,
    num_gpus=8,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, device_map=device_map, torch_dtype=torch.bfloat16, trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

# Select calibration dataset.
# its recommended to use more calibration samples for MoE models so each expert is hit
DATASET_ID = "/kefu-nas/xyb/quantization/dataset/ultrachat_200k"
DATASET_SPLIT = "train_sft"
NUM_CALIBRATION_SAMPLES = 2048
MAX_SEQUENCE_LENGTH = 2048


# Load dataset and preprocess.
ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))


def preprocess(example):
    return {
        "text": tokenizer.apply_chat_template(
            example["messages"],
            tokenize=False,
        )
    }


ds = ds.map(preprocess)


# Tokenize inputs.
def tokenize(sample):
    return tokenizer(
        sample["text"],
        padding=False,
        max_length=MAX_SEQUENCE_LENGTH,
        truncation=True,
        add_special_tokens=False,
    )


ds = ds.map(tokenize, remove_columns=ds.column_names)

# define a llmcompressor recipe for INT8 W8A8 quantization
# since the MoE gate layers are sensitive to quantization, we add them to the ignore
# list so they remain at full precision
recipe = [
    GPTQModifier(
        targets="Linear",
        scheme="W8A8",
        ignore=["lm_head", "re:.*mlp.gate$"],
    ),
]

SAVE_DIR = "deepseel_v3-W8A8"

oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    trust_remote_code_model=True,
    save_compressed=True,
    output_dir=SAVE_DIR,
)

The python process only used GPU 0 with 300MB when i have 8 gpus. But the process used 1.4T cpu memory.

Expected behavior
A clear and concise description of what you expected to happen.

Environment
Include all relevant environment information:

OS [e.g. Ubuntu 20.04]:
Python version [e.g. 3.7]:
LLM Compressor version or commit hash [e.g. 0.1.0, f7245c8]:
ML framework version(s) [e.g. torch 2.3.1]:
Other Python package versions [e.g. vLLM, compressed-tensors, numpy, ONNX]:
Other relevant environment information [e.g. hardware, CUDA version]:

To Reproduce
Exact steps to reproduce the behavior:

Errors
If applicable, add a full print-out of any errors or exceptions that are raised or include screenshots to help explain your problem.

Additional context
Add any other context about the problem here. Also include any relevant files.

The text was updated successfully, but these errors were encountered:

dsikka · 2025-03-11T03:12:57Z

Hi @mmdbhs

Do you see the model being used loaded onto the remaining 7 GPUs?

mmdbhs · 2025-03-11T03:18:50Z

Hi @mmdbhs

Do you see the model being used loaded onto the remaining 7 GPUs?

the remaining 7 GPUs only 4MB is used

mmdbhs added the bug Something isn't working label Mar 11, 2025

dsikka self-assigned this Mar 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use cpu memory 1.4TB, but use gpu memory 300MB #1240

use cpu memory 1.4TB, but use gpu memory 300MB #1240

mmdbhs commented Mar 11, 2025 •

edited

Loading

dsikka commented Mar 11, 2025

mmdbhs commented Mar 11, 2025 •

edited

Loading

use cpu memory 1.4TB, but use gpu memory 300MB #1240

use cpu memory 1.4TB, but use gpu memory 300MB #1240

Comments

mmdbhs commented Mar 11, 2025 • edited Loading

dsikka commented Mar 11, 2025

mmdbhs commented Mar 11, 2025 • edited Loading

mmdbhs commented Mar 11, 2025 •

edited

Loading

mmdbhs commented Mar 11, 2025 •

edited

Loading