Question about INT8 in AMD MI50 #11871

cdrfvb · 2025-02-14T16:16:24Z

cdrfvb
Feb 14, 2025

Hi, I am playing around with llama-cpp on an AMD EPYC Zen2 with multiple AMD MI50 cards.

I use models quantized with Q4_K_M or Q5_K_M, which indicates it could be calculated in INT8, and would not necessarily need FP16. I noticed that the KV buffers on the MI50 cards are always initialized using FP16 type.

When I try to change the K and V buffers to Q8_0, the server fails with the message that flash attention would be needed, and AFAIK these cards do not support flash attention currently. But as the cards should support calculations in INT8 natively according to the datasheet, could it not be possible to restrict the buffers to q8_0 without flash attention?
Actually the message I get when calling with -ctk q8_0 -ctv q8_0 -fa is:

llama_init_from_model: flash_attn requires n_embd_head_k == n_embd_head_v - forcing off
llama_init_from_model: V cache quantization requires flash_attn

I guess that is not the real reason as I set both the same type.

It seems that I can switch the K buffers to Q8_0, but not the V buffers.

Are the calculations on the cards done in INT8 or in FP16 if I use such a model, or how can I find out myself? I did not yet get deep enough in my understanding of the source code to find it myself. Is there a possibility to force INT8 calculations with INT8 buffers on the cards using ROCm?

8XXD8 · 2025-02-14T21:21:30Z

8XXD8
Feb 14, 2025

LLama.cpp has its own implementation of flash attention that runs on the MI50, but there are a few models like Deepseek that are not supported.

1 reply

cdrfvb Feb 14, 2025
Author

Oh, okay, that I did not know, and it explains it. I am running Deepseek R1. Is there any way else to use INT8 consistently without flash attention? As mentioned, I managed to switch the K buffer to Q8_0, but not the V buffer.

I also got out of memory errors not in the beginning when initializing the model but later after a few prompts. Does that also come from Deepseek? I read a discussion at Ollama where it was stated that prediction of memory consumption for Deepseek is not as good as for other models... Is that true? Actually, even though it might have reduced performance, using HIP UMA fixed that for me. I assume that the additional memory then is allocated in main memory instead and accessed by the card. Probably slower, but better than crash.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about INT8 in AMD MI50 #11871

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Question about INT8 in AMD MI50 #11871

cdrfvb Feb 14, 2025

Replies: 1 comment · 1 reply

8XXD8 Feb 14, 2025

cdrfvb Feb 14, 2025 Author

cdrfvb
Feb 14, 2025

Replies: 1 comment 1 reply

8XXD8
Feb 14, 2025

cdrfvb Feb 14, 2025
Author