You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think kernel for each head is executed sequentially:
single-head is hard to activate all performance of GPU
inter-head warps miss the overlap chance, which is important for GPU SIMT
Have you thought about improve the kernel to realize the parallelism of heads. I think that the head loop is ok if the context length is super long enough (maybe). However, the speedup is limited when sequence is not super long, like 60k, as the baseline flash attention compute head in parallel.
The text was updated successfully, but these errors were encountered:
Yes, there’s actually still quite a lot of room for optimization. In terms of priority, the next steps would be:
Migrating from Triton to CUDA,
Adding support for Hopper architecture, and
Supporting GQA.
Some of these optimizations have already been completed or are in progress by teams like Qwen vllm-project/vllm#11844 and SGLang sgl-project/sglang#5327.
We also have a few release planned, and we’ll update the kernels after the NeurIPS submission.
If you're interested in contributing, we’d be happy to discuss more with you — you're very welcome to join!
Describe the issue
Thank you again for your work.
MInference/minference/modules/minference_forward.py
Line 640 in 99d18c9
I think kernel for each head is executed sequentially:
Have you thought about improve the kernel to realize the parallelism of heads. I think that the head loop is ok if the context length is super long enough (maybe). However, the speedup is limited when sequence is not super long, like 60k, as the baseline flash attention compute head in parallel.
The text was updated successfully, but these errors were encountered: