Replies: 1 comment 1 reply
-
LLama.cpp has its own implementation of flash attention that runs on the MI50, but there are a few models like Deepseek that are not supported. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi, I am playing around with llama-cpp on an AMD EPYC Zen2 with multiple AMD MI50 cards.
I use models quantized with Q4_K_M or Q5_K_M, which indicates it could be calculated in INT8, and would not necessarily need FP16. I noticed that the KV buffers on the MI50 cards are always initialized using FP16 type.
When I try to change the K and V buffers to Q8_0, the server fails with the message that flash attention would be needed, and AFAIK these cards do not support flash attention currently. But as the cards should support calculations in INT8 natively according to the datasheet, could it not be possible to restrict the buffers to q8_0 without flash attention?
Actually the message I get when calling with
-ctk q8_0 -ctv q8_0 -fa
is:I guess that is not the real reason as I set both the same type.
It seems that I can switch the K buffers to Q8_0, but not the V buffers.
Are the calculations on the cards done in INT8 or in FP16 if I use such a model, or how can I find out myself? I did not yet get deep enough in my understanding of the source code to find it myself. Is there a possibility to force INT8 calculations with INT8 buffers on the cards using ROCm?
Beta Was this translation helpful? Give feedback.
All reactions