You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Qwen2.5-Coder-32B + Qwen2.5-Coder-1.5B as draft model
Describe the bug
When loading the Qwen2.5-Coder-32B model through Exui, I am getting around 20 tokens/s with a 4_25 bpw quant (unrelated, but this is also lacking compared to the 25+ I am seeing with llamacpp). However, when loading the 1.5B version of the model as a draft model, performance drops down to below 16 tokens/sec instead of experiencing a speedup. I do experience a speedup with llamacpp speculative decoding (a little over 2x).
Reproduction steps
Load the 32B model through exui
Ask for a story in a chat
See around 20 tokens/second
Unload the 32B model
Load the 32B model + 1.5B draft model
Ask for another story in a new chat
Expected behavior
A speed boost is obtained.
Actual outcome: The performance regresses.
Logs
No response
Additional context
No response
Acknowledgements
I have looked for similar issues before submitting this one.
I understand that the developers have lives and my issue will be answered when possible.
I understand the developers of this program are human, and I will ask my questions politely.
The text was updated successfully, but these errors were encountered:
OS
Linux
GPU Library
AMD ROCm
Python version
3.12
Pytorch version
Pulled from https://download.pytorch.org/whl/rocm6.2 yesterday
Model
Qwen2.5-Coder-32B + Qwen2.5-Coder-1.5B as draft model
Describe the bug
When loading the Qwen2.5-Coder-32B model through Exui, I am getting around 20 tokens/s with a 4_25 bpw quant (unrelated, but this is also lacking compared to the 25+ I am seeing with llamacpp). However, when loading the 1.5B version of the model as a draft model, performance drops down to below 16 tokens/sec instead of experiencing a speedup. I do experience a speedup with llamacpp speculative decoding (a little over 2x).
Reproduction steps
Expected behavior
A speed boost is obtained.
Actual outcome: The performance regresses.
Logs
No response
Additional context
No response
Acknowledgements
The text was updated successfully, but these errors were encountered: