Skip to content

[Question]: Potential chance for better attention kernel #140

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
GuoYiFantastic opened this issue Apr 24, 2025 · 1 comment
Open

[Question]: Potential chance for better attention kernel #140

GuoYiFantastic opened this issue Apr 24, 2025 · 1 comment
Assignees
Labels
question Further information is requested

Comments

@GuoYiFantastic
Copy link
Contributor

Describe the issue

Thank you again for your work.

for head in range(query_states.size(1)):

I think kernel for each head is executed sequentially:

  1. single-head is hard to activate all performance of GPU
  2. inter-head warps miss the overlap chance, which is important for GPU SIMT

Have you thought about improve the kernel to realize the parallelism of heads. I think that the head loop is ok if the context length is super long enough (maybe). However, the speedup is limited when sequence is not super long, like 60k, as the baseline flash attention compute head in parallel.

@GuoYiFantastic GuoYiFantastic added the question Further information is requested label Apr 24, 2025
@iofu728 iofu728 self-assigned this Apr 28, 2025
@iofu728
Copy link
Contributor

iofu728 commented Apr 28, 2025

Hi @GuoYiFantastic, thanks for your support!

Yes, there’s actually still quite a lot of room for optimization. In terms of priority, the next steps would be:

  1. Migrating from Triton to CUDA,
  2. Adding support for Hopper architecture, and
  3. Supporting GQA.

Some of these optimizations have already been completed or are in progress by teams like Qwen vllm-project/vllm#11844 and SGLang sgl-project/sglang#5327.
We also have a few release planned, and we’ll update the kernels after the NeurIPS submission.

If you're interested in contributing, we’d be happy to discuss more with you — you're very welcome to join!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants