[BUG] Speculative decoding regresses performance on 7900 xtx under ROCM #685

Mushoz · 2024-11-25T14:33:33Z

OS

Linux

GPU Library

AMD ROCm

Python version

3.12

Pytorch version

Pulled from https://download.pytorch.org/whl/rocm6.2 yesterday

Model

Qwen2.5-Coder-32B + Qwen2.5-Coder-1.5B as draft model

Describe the bug

When loading the Qwen2.5-Coder-32B model through Exui, I am getting around 20 tokens/s with a 4_25 bpw quant (unrelated, but this is also lacking compared to the 25+ I am seeing with llamacpp). However, when loading the 1.5B version of the model as a draft model, performance drops down to below 16 tokens/sec instead of experiencing a speedup. I do experience a speedup with llamacpp speculative decoding (a little over 2x).

Reproduction steps

Load the 32B model through exui
Ask for a story in a chat
See around 20 tokens/second
Unload the 32B model
Load the 32B model + 1.5B draft model
Ask for another story in a new chat

Expected behavior

A speed boost is obtained.

Actual outcome: The performance regresses.

Logs

No response

Additional context

No response

Acknowledgements

I have looked for similar issues before submitting this one.
I understand that the developers have lives and my issue will be answered when possible.
I understand the developers of this program are human, and I will ask my questions politely.

Originalimoc · 2024-11-27T15:23:16Z

Set lower max seq len, recommend 16k.

Mushoz added the bug Something isn't working label Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Speculative decoding regresses performance on 7900 xtx under ROCM #685

[BUG] Speculative decoding regresses performance on 7900 xtx under ROCM #685

Mushoz commented Nov 25, 2024

Originalimoc commented Nov 27, 2024

[BUG] Speculative decoding regresses performance on 7900 xtx under ROCM #685

[BUG] Speculative decoding regresses performance on 7900 xtx under ROCM #685

Comments

Mushoz commented Nov 25, 2024

OS

GPU Library

Python version

Pytorch version

Model

Describe the bug

Reproduction steps

Expected behavior

Logs

Additional context

Acknowledgements

Originalimoc commented Nov 27, 2024