Error after loading deepseekv3_cpu #707

Tortoise17 · 2025-02-07T08:51:43Z

I am trying to load the model DeepSeek_V3 for inference, but after loading the model, I am facing this error below.

The environments have environments below


# pip list
Package                          Version
-------------------------------- -----------
absl-py                          2.1.0
accelerate                       1.3.0
aiohappyeyeballs                 2.4.4
aiohttp                          3.11.12
aiosignal                        1.3.2
async-timeout                    5.0.1
attrs                            25.1.0
auto_round                       0.4.5
autoawq                          0.2.8
autoawq_kernels                  0.0.9
certifi                          2025.1.31
chardet                          5.2.0
charset-normalizer               3.4.1
click                            8.1.8
colorama                         0.4.6
contourpy                        1.3.1
cycler                           0.12.1
DataProperty                     1.1.0
datasets                         3.2.0
Deprecated                       1.2.18
dill                             0.3.8
einops                           0.8.0
evaluate                         0.4.3
filelock                         3.17.0
flash-attn                       2.7.3
fonttools                        4.55.8
frozenlist                       1.5.0
fsspec                           2024.9.0
huggingface-hub                  0.28.1
idna                             3.10
intel_extension_for_pytorch      2.5.0
intel-extension-for-transformers 1.4.2
Jinja2                           3.1.5
joblib                           1.4.2
jsonlines                        4.0.0
kiwisolver                       1.4.8
llvmlite                         0.44.0
lm_eval                          0.4.7
lxml                             5.3.0
MarkupSafe                       3.0.2
matplotlib                       3.10.0
mbstrdecoder                     1.1.4
more-itertools                   10.6.0
mpmath                           1.3.0
multidict                        6.1.0
multiprocess                     0.70.16
networkx                         3.4.2
neural_compressor                3.2
nltk                             3.9.1
numba                            0.61.0
numexpr                          2.10.2
numpy                            1.26.4
nvidia-cublas-cu12               12.4.5.8
nvidia-cuda-cupti-cu12           12.4.127
nvidia-cuda-nvrtc-cu12           12.4.127
nvidia-cuda-runtime-cu12         12.4.127
nvidia-cudnn-cu12                9.1.0.70
nvidia-cufft-cu12                11.2.1.3
nvidia-curand-cu12               10.3.5.147
nvidia-cusolver-cu12             11.6.1.9
nvidia-cusparse-cu12             12.3.1.170
nvidia-cusparselt-cu12           0.6.2
nvidia-nccl-cu12                 2.21.5
nvidia-nvjitlink-cu12            12.4.127
nvidia-nvtx-cu12                 12.4.127
opencv-python-headless           4.11.0.86
packaging                        24.2
pandas                           2.2.3
pathvalidate                     3.2.3
peft                             0.14.0
pillow                           11.1.0
pip                              25.0
portalocker                      3.1.1
prettytable                      3.14.0
propcache                        0.2.1
psutil                           6.1.1
py-cpuinfo                       9.0.0
pyarrow                          19.0.0
pybind11                         2.13.6
pycocotools                      2.0.8
pyparsing                        3.2.1
pytablewriter                    1.2.1
python-dateutil                  2.9.0.post0
pytz                             2025.1
PyYAML                           6.0.2
regex                            2024.11.6
requests                         2.32.3
rouge_score                      0.1.2
sacrebleu                        2.5.1
safetensors                      0.5.2
schema                           0.7.7
scikit-learn                     1.6.1
scipy                            1.15.1
sentencepiece                    0.2.0
setuptools                       75.8.0
six                              1.17.0
sqlitedict                       2.1.0
sympy                            1.13.1
tabledata                        1.3.4
tabulate                         0.9.0
tbb                              2022.0.0
tcmlib                           1.2.0
tcolorpy                         0.1.7
threadpoolctl                    3.5.0
tokenizers                       0.21.0
torch                            2.5.1
torchaudio                       2.5.1
torchvision                      0.20.1
tqdm                             4.67.1
tqdm-multiprocess                0.0.11
transformers                     4.47.1
triton                           3.1.0
typepy                           1.3.4
typing_extensions                4.12.2
tzdata                           2025.1
urllib3                          2.3.0
wcwidth                          0.2.13
wheel                            0.45.1
word2number                      1.1
wrapt                            1.17.2
xxhash                           3.5.0
yarl                             1.18.3
zstandard                        0.23.0

and the error comes below is

/scratch/local2/alpha/lib/python3.10/site-packages/auto_round/auto_quantizer.py:191: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.However, loading attributes (e.g. ['target_backend']) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored.
  warnings.warn(warning_msg)
We suggest you to set `torch_dtype=torch.float16` for better efficiency with AWQ.
2025-02-07 09:08:32,209 INFO config.py L54: PyTorch version 2.5.1 available.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 71/71 [14:35<00:00, 12.34s/it]
Setting `pad_token_id` to `eos_token_id`:1 for open-end generation.
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
`get_max_cache()` is deprecated for all Cache classes. Use `get_max_cache_shape()` instead. Calling `get_max_cache()` will raise error from v4.48
Traceback (most recent call last):
  File "/media/research/working_space/lab_one/DeepSeek-V3-int4-sym-awq-inc-cpu/deepseekrun.py", line 46, in <module>
    outputs = model.generate(
  File "/scratch/local2/alpha/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/scratch/local2/alpha/lib/python3.10/site-packages/transformers/generation/utils.py", line 2252, in generate
    result = self._sample(
  File "/scratch/local2/alpha/lib/python3.10/site-packages/transformers/generation/utils.py", line 3251, in _sample
    outputs = self(**model_inputs, return_dict=True)
  File "/scratch/local2/alpha/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/scratch/local2/alpha/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/u892406/.cache/huggingface/modules/transformers_modules/DeepSeek-V3-int4-sym-awq-inc-cpu/modeling_deepseek.py", line 1602, in forward
    outputs = self.model(
  File "/scratch/local2/alpha/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/scratch/local2/alpha/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/u892406/.cache/huggingface/modules/transformers_modules/DeepSeek-V3-int4-sym-awq-inc-cpu/modeling_deepseek.py", line 1471, in forward
    layer_outputs = decoder_layer(
  File "/scratch/local2/alpha/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/scratch/local2/alpha/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/u892406/.cache/huggingface/modules/transformers_modules/DeepSeek-V3-int4-sym-awq-inc-cpu/modeling_deepseek.py", line 1203, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/scratch/local2/alpha/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/scratch/local2/alpha/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/u892406/.cache/huggingface/modules/transformers_modules/DeepSeek-V3-int4-sym-awq-inc-cpu/modeling_deepseek.py", line 770, in forward
    q = self.q_b_proj(self.q_a_layernorm(self.q_a_proj(hidden_states)))
  File "/scratch/local2/alpha/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/scratch/local2/alpha/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/scratch/local2/alpha/lib/python3.10/site-packages/awq/modules/linear/gemm.py", line 270, in forward
    out = WQLinearMMFunction.apply(
  File "/scratch/local2/alpha/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/scratch/local2/alpha/lib/python3.10/site-packages/awq/modules/linear/gemm.py", line 54, in forward
    out = awq_ext.gemm_forward_cuda(

The autoawq installed (i tried both kernel and CPU version) as well as normal without any preference. Also I tried autoawq 0.2.7 but error exists in all manner. I tried ot change the transformers, downgraded. But could not help either. Please if there is any hint how to get rid of this.

The text was updated successfully, but these errors were encountered:

Egor-Krivov · 2025-02-07T13:04:43Z

Looks like your traceback is incomplete. Could you share the full traceback, including the error?

I think the issue is that you have cuda installed and awq is probably trying to use cuda backend.

Tortoise17 · 2025-02-07T13:08:56Z

@Egor-Krivov locally machine has the cuda and GPUs but not big enough to hold the deepseek_v3 model. The above is full traceback. the script used to run the model is below :

from auto_round import AutoRoundConfig  ##must import for autoround format
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

quantized_model_dir = "OPEA/DeepSeek-V3-int4-sym-awq-inc-cpu"

quantization_config = AutoRoundConfig(
    backend="cpu"
)

model = AutoModelForCausalLM.from_pretrained(
    quantized_model_dir,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="cpu",
    revision="16eb0b2",##auto-round format, the only difference is config.json
    quantization_config=quantization_config,  ##cpu only machine does not set this

)

tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, trust_remote_code=True)
prompts = [
    "9.11和9.8哪个数字大",
    "strawberry中有几个r?",
    "How many r in strawberry.",
    "There is a girl who likes adventure,",
    "Please give a brief introduction of DeepSeek company.",
    "hello"

]

texts=[]
for prompt in prompts:
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    texts.append(text)
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)

outputs = model.generate(
    input_ids=inputs["input_ids"].to(model.device),
    attention_mask=inputs["attention_mask"].to(model.device),
    max_length=512,
    num_return_sequences=1, 
    do_sample=False
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs["input_ids"], outputs)
]

decoded_outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

for i, prompt in enumerate(prompts):
    input_id = inputs
    print(f"Prompt: {prompt}")
    print(f"Generated: {decoded_outputs[i]}")
    print("-" * 50)

Tortoise17 · 2025-02-07T19:01:23Z

@Egor-Krivov I tried inside docker (Linux) and still there is error. It looks for the GPU which should not be the case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error after loading deepseekv3_cpu #707

Error after loading deepseekv3_cpu #707

Tortoise17 commented Feb 7, 2025 •

edited

Loading

Egor-Krivov commented Feb 7, 2025

Tortoise17 commented Feb 7, 2025

Tortoise17 commented Feb 7, 2025

Error after loading deepseekv3_cpu #707

Error after loading deepseekv3_cpu #707

Comments

Tortoise17 commented Feb 7, 2025 • edited Loading

Egor-Krivov commented Feb 7, 2025

Tortoise17 commented Feb 7, 2025

Tortoise17 commented Feb 7, 2025

Tortoise17 commented Feb 7, 2025 •

edited

Loading