-
Notifications
You must be signed in to change notification settings - Fork 125
AWQ Qwen3-235B-A22B and Qwen3-30B-A3B #1406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
FYI - I also try with w8a16 and it works, the problem is in AWQ
|
Hi @ehartford , thanks for your interest in AWQ and for bringing this to our attention. While it seems the non-MoE Qwen3 models ran, these MoE models are hanging while resolving the mappings. We are using string matches, and it causes runtime to increase dramatically looping over 48 layers, each with 128 experts in the case of This isn't an issue in AutoAWQ, which has custom wrappers for each model (Qwen3MoE example here). I will try to address this by end of next week |
I'm running your AWQ code on a single RTX A6000 48 GB VRAM and after allocating ~42 GB for the model it sits with no GPU utilization and a single CPU core spinning at 100% for python. I'll let it sit overnight and possibly it will loop over the 48 layers x 128 experts eventually?
When I tried
I saw an open issue on the hugging face repo too: https://huggingface.co/Qwen/Qwen3-30B-A3B/discussions/12 Will check in later, thanks! |
Ok but I think it will hang there forever, I let mine sit overnight |
lmao, it seems like it got through the loop but then of course it OOMd when it went to do the actual thing hahah
So if you have enough VRAM you might wake up to the worlds first Qwen3-30B-A3B AWQ who knows xD! Looking at the timestamps from the logs it took a little over 30 minutes to work through the loop on a AMD Ryzen Threadripper PRO 7965WX 24-Cores (running 1 core single threaded on python). |
Yes, it will likely OOM for larger models. We cache the calibrated activations for the entire model, rather than layer-by-layer, so memory requirements do not scale well with model size. AutoAWQ handles this, but we need to integrate our own pipelining abstraction and wanted to do that in a follow-up PR. We need to add that feature in order for our implementation of AWQ to really be fully ready, what we have so far is a basic port of AutoAWQ not quite ready for primetime. Related issue -- #1369 (comment) |
Thanks! Yeah and seems like no support for CPU backend as I tried: I'd love to get AWQ going and output GGUFs to test against ik_llama.cpp imatrix quants e.g. my ubergarm/Qwen3-30B-A3B-GGUF Guessing inference speed with vllm would be better, and not sure how to test perplexity and KLD etc on AWQ quants. Anyway, beyond the scope. Cheers and thanks for all your efforts! |
@ehartford just got this running a moment ago, takes about 17GB VRAM to load plus as much extra for parallel inferencing slots:
Not sure how they quantized their model, but maybe how you were trying with enough time and VRAM. |
Hi @ubergarm , yes AWQ will require a GPU to run in a reasonable amount of time for most models. We've got that somewhat hard-coded for now, and we'll have better support for offloaded models in a future release. Yeah, I noticed Qwen publishes some AWQ-ed models (https://huggingface.co/Qwen/Qwen3-32B-AWQ) but no MoE models. There do seem to be lots in the community though 💪 |
Describe the bug
When I try to AWQ these models, it hangs forever.
Expected behavior
I expect it to quantize the model
Environment
Nvidia DGX A100
To Reproduce
I used examples/awq/awq_one_shot.py and modified it:
The output
The text was updated successfully, but these errors were encountered: