Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error after loading deepseekv3_cpu #707

Open
Tortoise17 opened this issue Feb 7, 2025 · 3 comments
Open

Error after loading deepseekv3_cpu #707

Tortoise17 opened this issue Feb 7, 2025 · 3 comments

Comments

@Tortoise17
Copy link

Tortoise17 commented Feb 7, 2025

I am trying to load the model DeepSeek_V3 for inference, but after loading the model, I am facing this error below.

The environments have environments below


# pip list
Package                          Version
-------------------------------- -----------
absl-py                          2.1.0
accelerate                       1.3.0
aiohappyeyeballs                 2.4.4
aiohttp                          3.11.12
aiosignal                        1.3.2
async-timeout                    5.0.1
attrs                            25.1.0
auto_round                       0.4.5
autoawq                          0.2.8
autoawq_kernels                  0.0.9
certifi                          2025.1.31
chardet                          5.2.0
charset-normalizer               3.4.1
click                            8.1.8
colorama                         0.4.6
contourpy                        1.3.1
cycler                           0.12.1
DataProperty                     1.1.0
datasets                         3.2.0
Deprecated                       1.2.18
dill                             0.3.8
einops                           0.8.0
evaluate                         0.4.3
filelock                         3.17.0
flash-attn                       2.7.3
fonttools                        4.55.8
frozenlist                       1.5.0
fsspec                           2024.9.0
huggingface-hub                  0.28.1
idna                             3.10
intel_extension_for_pytorch      2.5.0
intel-extension-for-transformers 1.4.2
Jinja2                           3.1.5
joblib                           1.4.2
jsonlines                        4.0.0
kiwisolver                       1.4.8
llvmlite                         0.44.0
lm_eval                          0.4.7
lxml                             5.3.0
MarkupSafe                       3.0.2
matplotlib                       3.10.0
mbstrdecoder                     1.1.4
more-itertools                   10.6.0
mpmath                           1.3.0
multidict                        6.1.0
multiprocess                     0.70.16
networkx                         3.4.2
neural_compressor                3.2
nltk                             3.9.1
numba                            0.61.0
numexpr                          2.10.2
numpy                            1.26.4
nvidia-cublas-cu12               12.4.5.8
nvidia-cuda-cupti-cu12           12.4.127
nvidia-cuda-nvrtc-cu12           12.4.127
nvidia-cuda-runtime-cu12         12.4.127
nvidia-cudnn-cu12                9.1.0.70
nvidia-cufft-cu12                11.2.1.3
nvidia-curand-cu12               10.3.5.147
nvidia-cusolver-cu12             11.6.1.9
nvidia-cusparse-cu12             12.3.1.170
nvidia-cusparselt-cu12           0.6.2
nvidia-nccl-cu12                 2.21.5
nvidia-nvjitlink-cu12            12.4.127
nvidia-nvtx-cu12                 12.4.127
opencv-python-headless           4.11.0.86
packaging                        24.2
pandas                           2.2.3
pathvalidate                     3.2.3
peft                             0.14.0
pillow                           11.1.0
pip                              25.0
portalocker                      3.1.1
prettytable                      3.14.0
propcache                        0.2.1
psutil                           6.1.1
py-cpuinfo                       9.0.0
pyarrow                          19.0.0
pybind11                         2.13.6
pycocotools                      2.0.8
pyparsing                        3.2.1
pytablewriter                    1.2.1
python-dateutil                  2.9.0.post0
pytz                             2025.1
PyYAML                           6.0.2
regex                            2024.11.6
requests                         2.32.3
rouge_score                      0.1.2
sacrebleu                        2.5.1
safetensors                      0.5.2
schema                           0.7.7
scikit-learn                     1.6.1
scipy                            1.15.1
sentencepiece                    0.2.0
setuptools                       75.8.0
six                              1.17.0
sqlitedict                       2.1.0
sympy                            1.13.1
tabledata                        1.3.4
tabulate                         0.9.0
tbb                              2022.0.0
tcmlib                           1.2.0
tcolorpy                         0.1.7
threadpoolctl                    3.5.0
tokenizers                       0.21.0
torch                            2.5.1
torchaudio                       2.5.1
torchvision                      0.20.1
tqdm                             4.67.1
tqdm-multiprocess                0.0.11
transformers                     4.47.1
triton                           3.1.0
typepy                           1.3.4
typing_extensions                4.12.2
tzdata                           2025.1
urllib3                          2.3.0
wcwidth                          0.2.13
wheel                            0.45.1
word2number                      1.1
wrapt                            1.17.2
xxhash                           3.5.0
yarl                             1.18.3
zstandard                        0.23.0

and the error comes below is

/scratch/local2/alpha/lib/python3.10/site-packages/auto_round/auto_quantizer.py:191: UserWarning: You passed `quantization_config` or equivalent parameters to `from_pretrained` but the model you're loading already has a `quantization_config` attribute. The `quantization_config` from the model will be used.However, loading attributes (e.g. ['target_backend']) will be overwritten with the one you passed to `from_pretrained`. The rest will be ignored.
  warnings.warn(warning_msg)
We suggest you to set `torch_dtype=torch.float16` for better efficiency with AWQ.
2025-02-07 09:08:32,209 INFO config.py L54: PyTorch version 2.5.1 available.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 71/71 [14:35<00:00, 12.34s/it]
Setting `pad_token_id` to `eos_token_id`:1 for open-end generation.
The `seen_tokens` attribute is deprecated and will be removed in v4.41. Use the `cache_position` model input instead.
`get_max_cache()` is deprecated for all Cache classes. Use `get_max_cache_shape()` instead. Calling `get_max_cache()` will raise error from v4.48
Traceback (most recent call last):
  File "/media/research/working_space/lab_one/DeepSeek-V3-int4-sym-awq-inc-cpu/deepseekrun.py", line 46, in <module>
    outputs = model.generate(
  File "/scratch/local2/alpha/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/scratch/local2/alpha/lib/python3.10/site-packages/transformers/generation/utils.py", line 2252, in generate
    result = self._sample(
  File "/scratch/local2/alpha/lib/python3.10/site-packages/transformers/generation/utils.py", line 3251, in _sample
    outputs = self(**model_inputs, return_dict=True)
  File "/scratch/local2/alpha/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/scratch/local2/alpha/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/u892406/.cache/huggingface/modules/transformers_modules/DeepSeek-V3-int4-sym-awq-inc-cpu/modeling_deepseek.py", line 1602, in forward
    outputs = self.model(
  File "/scratch/local2/alpha/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/scratch/local2/alpha/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/u892406/.cache/huggingface/modules/transformers_modules/DeepSeek-V3-int4-sym-awq-inc-cpu/modeling_deepseek.py", line 1471, in forward
    layer_outputs = decoder_layer(
  File "/scratch/local2/alpha/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/scratch/local2/alpha/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/u892406/.cache/huggingface/modules/transformers_modules/DeepSeek-V3-int4-sym-awq-inc-cpu/modeling_deepseek.py", line 1203, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/scratch/local2/alpha/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/scratch/local2/alpha/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/u892406/.cache/huggingface/modules/transformers_modules/DeepSeek-V3-int4-sym-awq-inc-cpu/modeling_deepseek.py", line 770, in forward
    q = self.q_b_proj(self.q_a_layernorm(self.q_a_proj(hidden_states)))
  File "/scratch/local2/alpha/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/scratch/local2/alpha/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/scratch/local2/alpha/lib/python3.10/site-packages/awq/modules/linear/gemm.py", line 270, in forward
    out = WQLinearMMFunction.apply(
  File "/scratch/local2/alpha/lib/python3.10/site-packages/torch/autograd/function.py", line 575, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/scratch/local2/alpha/lib/python3.10/site-packages/awq/modules/linear/gemm.py", line 54, in forward
    out = awq_ext.gemm_forward_cuda(

The autoawq installed (i tried both kernel and CPU version) as well as normal without any preference. Also I tried autoawq 0.2.7 but error exists in all manner. I tried ot change the transformers, downgraded. But could not help either. Please if there is any hint how to get rid of this.

@Egor-Krivov
Copy link
Contributor

Looks like your traceback is incomplete. Could you share the full traceback, including the error?

I think the issue is that you have cuda installed and awq is probably trying to use cuda backend.

@Tortoise17
Copy link
Author

@Egor-Krivov locally machine has the cuda and GPUs but not big enough to hold the deepseek_v3 model. The above is full traceback. the script used to run the model is below :

from auto_round import AutoRoundConfig  ##must import for autoround format
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

quantized_model_dir = "OPEA/DeepSeek-V3-int4-sym-awq-inc-cpu"

quantization_config = AutoRoundConfig(
    backend="cpu"
)

model = AutoModelForCausalLM.from_pretrained(
    quantized_model_dir,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="cpu",
    revision="16eb0b2",##auto-round format, the only difference is config.json
    quantization_config=quantization_config,  ##cpu only machine does not set this

)

tokenizer = AutoTokenizer.from_pretrained(quantized_model_dir, trust_remote_code=True)
prompts = [
    "9.11和9.8哪个数字大",
    "strawberry中有几个r?",
    "How many r in strawberry.",
    "There is a girl who likes adventure,",
    "Please give a brief introduction of DeepSeek company.",
    "hello"

]

texts=[]
for prompt in prompts:
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    texts.append(text)
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)

outputs = model.generate(
    input_ids=inputs["input_ids"].to(model.device),
    attention_mask=inputs["attention_mask"].to(model.device),
    max_length=512,
    num_return_sequences=1, 
    do_sample=False
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs["input_ids"], outputs)
]

decoded_outputs = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)

for i, prompt in enumerate(prompts):
    input_id = inputs
    print(f"Prompt: {prompt}")
    print(f"Generated: {decoded_outputs[i]}")
    print("-" * 50)

@Tortoise17
Copy link
Author

@Egor-Krivov I tried inside docker (Linux) and still there is error. It looks for the GPU which should not be the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants