[Question]: Potential chance for better attention kernel #140

GuoYiFantastic · 2025-04-24T02:47:30Z

Describe the issue

Thank you again for your work.

MInference/minference/modules/minference_forward.py

Line 640 in 99d18c9

for head in range(query_states.size(1)):

I think kernel for each head is executed sequentially:

single-head is hard to activate all performance of GPU
inter-head warps miss the overlap chance, which is important for GPU SIMT

Have you thought about improve the kernel to realize the parallelism of heads. I think that the head loop is ok if the context length is super long enough (maybe). However, the speedup is limited when sequence is not super long, like 60k, as the baseline flash attention compute head in parallel.

iofu728 · 2025-04-28T07:29:10Z

Hi @GuoYiFantastic, thanks for your support!

Yes, there’s actually still quite a lot of room for optimization. In terms of priority, the next steps would be:

Migrating from Triton to CUDA,
Adding support for Hopper architecture, and
Supporting GQA.

Some of these optimizations have already been completed or are in progress by teams like Qwen vllm-project/vllm#11844 and SGLang sgl-project/sglang#5327.
We also have a few release planned, and we’ll update the kernels after the NeurIPS submission.

If you're interested in contributing, we’d be happy to discuss more with you — you're very welcome to join!

GuoYiFantastic added the question Further information is requested label Apr 24, 2025

iofu728 self-assigned this Apr 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: Potential chance for better attention kernel #140

[Question]: Potential chance for better attention kernel #140

GuoYiFantastic commented Apr 24, 2025

iofu728 commented Apr 28, 2025

[Question]: Potential chance for better attention kernel #140

[Question]: Potential chance for better attention kernel #140

Comments

GuoYiFantastic commented Apr 24, 2025

Describe the issue

iofu728 commented Apr 28, 2025