Implements dual-chunk-flash-attn backend for dual chunk attention with sparse attention support #11844

sighingnow · 2025-01-08T13:09:11Z

This PR implements the dual-chunk flash attention, a training-free method to extend model context length (see also #6139), with sparse attention (https://github.com/microsoft/MInference) support.

This PR requires the sparse attention kernel from vllm-flash-attention. Qwen models with 1m context length support will be open-sourced in the next one or two weeks, and unit tests will be added later.

FIX #12452

github-actions · 2025-01-08T13:09:27Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

jacob-crux · 2025-01-09T09:58:48Z

I see that you have enforce_eager=True set, so it looks like there are still compatibility issues with cudagraph.
Do you plan to fix this in the future?

mergify · 2025-01-13T12:33:26Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sighingnow.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

sighingnow · 2025-01-14T03:42:35Z

I see that you have enforce_eager=True set, so it looks like there are still compatibility issues with cudagraph. Do you plan to fix this in the future?

All conflicts fixed, could you please take another look? thanks!

jacob-crux · 2025-01-15T05:36:01Z

vllm/attention/backends/dual_chunk_flash_attn.py

+                               st] = decode_metadata.block_tables[i, st:ed]
+        decode_metadata.block_tables_intra = block_tables_intra
+
+        seq_lens_succ = (chunk_num_curr -


When I try the Needle in a haystack test with qwen-7b and llama-8b(Modified code to support llama), there is a bug that produces a negative number when it is over 13k~15k.
I modified the code as below and confirmed that it works.

seq_lens_succ = ((chunk_num_curr - (chunk_num_curr - 1).clip(min=0)) * chunk_len)

mergify · 2025-01-15T05:36:57Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sighingnow.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

jacob-crux · 2025-01-15T05:40:44Z

I see that you have enforce_eager=True set, so it looks like there are still compatibility issues with cudagraph. Do you plan to fix this in the future?

All conflicts fixed, could you please take another look? thanks!

I tested it because I thought it was fixed, but I still have the same problem as below.
Are you saying that Cudagraph capture is possible? (enforce_eager=False)

Capturing CUDA graph shapes:   0%|                                                                                                                                                                                                               | 0/35 [00:00<?, ?it/s]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/data/lme-storage_810/jacob/needle/NeedleInAHaystack-lme/run_needle_in_haystack.py", line 435, in <module>
[rank0]:     ht = LLMNeedleHaystackTester(
[rank0]:          ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/data/lme-storage_810/jacob/needle/NeedleInAHaystack-lme/run_needle_in_haystack.py", line 94, in __init__
[rank0]:     self.model_to_test = LLM(model=model_name)
[rank0]:                          ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/utils.py", line 1044, in inner
[rank0]:     return fn(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/entrypoints/llm.py", line 228, in __init__
[rank0]:     self.llm_engine = self.engine_class.from_engine_args(
[rank0]:                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/engine/llm_engine.py", line 517, in from_engine_args
[rank0]:     engine = cls(
[rank0]:              ^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/engine/llm_engine.py", line 276, in __init__
[rank0]:     self._initialize_kv_caches()
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/engine/llm_engine.py", line 429, in _initialize_kv_caches
[rank0]:     self.model_executor.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/executor/gpu_executor.py", line 83, in initialize_cache
[rank0]:     self.driver_worker.initialize_cache(num_gpu_blocks, num_cpu_blocks)
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/worker/worker.py", line 274, in initialize_cache
[rank0]:     self._warm_up_model()
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/worker/worker.py", line 292, in _warm_up_model
[rank0]:     self.model_runner.capture_model(self.gpu_cache)
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/worker/model_runner.py", line 1533, in capture_model
[rank0]:     graph_runner.capture(**capture_inputs)
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/worker/model_runner.py", line 1885, in capture
[rank0]:     self.model(
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/model_executor/models/qwen2.py", line 496, in forward
[rank0]:     hidden_states = self.model(input_ids, positions, kv_caches,
[rank0]:                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/compilation/decorators.py", line 170, in __call__
[rank0]:     return self.forward(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/model_executor/models/qwen2.py", line 359, in forward
[rank0]:     hidden_states, residual = layer(
[rank0]:                               ^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/model_executor/models/qwen2.py", line 267, in forward
[rank0]:     hidden_states = self.self_attn(
[rank0]:                     ^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/model_executor/models/qwen2.py", line 189, in forward
[rank0]:     attn_output = self.attn(q,
[rank0]:                   ^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/attention/layer.py", line 185, in forward
[rank0]:     return torch.ops.vllm.unified_attention(
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1116, in __call__
[rank0]:     return self._op(*args, **(kwargs or {}))
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/attention/layer.py", line 280, in unified_attention
[rank0]:     return self.impl.forward(query, key, value, kv_cache, attn_metadata,
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/bc-user/vllm_dual_chunk_250114/vllm/vllm/attention/backends/dual_chunk_flash_attn.py", line 373, in forward
[rank0]:     assert decode_meta.scaling_factor is not None
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: AssertionError

sighingnow · 2025-01-16T09:33:43Z

I tested it because I thought it was fixed, but I still have the same problem as below.
Are you saying that Cudagraph capture is possible? (enforce_eager=False)

The dual chunk attention doesn't support cuda graph and I have added an assertion in arg_utils.py.

When I try the Needle in a haystack test with qwen-7b and llama-8b(Modified code to support llama), there is a bug that produces a negative number when it is over 13k~15k.

It is indeed a bug introduced during preparing this PR, fixed. Thanks!

sighingnow · 2025-01-19T09:40:13Z

Rebase against main.

Hi @youkaichao @simon-mo @WoosukKwon Do you folks think if there are still things that need to be improved in this pull request?

Thanks!

tlrmchlsmth

Spotted a few bits ofcommented out code that look like debug cruft or are otherwise mysterious. Could you clean those up and any other similar spots?

csrc/attention/vertical_slash_index.cu

mergify · 2025-01-20T21:03:06Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sighingnow.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

examples/offline_inference_qwen_1m.py

vllm/attention/backends/flash_attn.py

vllm/engine/arg_utils.py

vllm/attention/backends/dual_chunk_flash_attn.py

vllm/attention/backends/flash_attn.py

vllm/attention/backends/dual_chunk_flash_attn.py

vllm/attention/backends/xformers.py

LucasWilkinson · 2025-01-20T18:42:38Z

vllm/model_executor/layers/rotary_embedding.py

+        qc_freqs = torch.einsum("i,j -> ij", qc_t, inv_freq)
+        k_freqs = torch.einsum("i,j -> ij", k_t, inv_freq)
+        qc_no_clamp_freqs = torch.einsum("i,j -> ij", qc_no_clamp_t, inv_freq)
+        q_inter_freqs = torch.einsum("i,j -> ij", q_inter_t, inv_freq)


nit: I think these einsum's are still slow on cuda than (a * b).sum(-1), not on the hot path though so not critical

pytorch/pytorch#101249

ran bench_einsum.py from that issue on an H100 and got:

python einsum_bench.py [------------------------------------- -------------------------------------] | mul/sum | torch.einsum | numpy.einsum 1 threads: ------------------------------------------------------------------- Nc,Nc->N cpu (1048576, 2) | 5000 | 3100 | 4000 Nc,Nc->N cuda (1048576, 2) | 20 | 747 | 3300 Times are in microseconds (us).

vllm/worker/model_runner.py

LucasWilkinson · 2025-01-20T22:37:48Z

vllm/attention/layer.py

+            logits_soft_cap, attn_type, **{
+                "dual_chunk_attention_config": dual_chunk_attention_config,
+                "prefix": prefix,
+            } if dual_chunk_attention_config is not None else {})


I feel like this messy, I think we should maybe do something like:

def __init__(..., **extra_attn_kwargs): self.impl = impl_cls(..., **extra_attn_kwargs)

the challenge here is prefix would not be captured by extra_attn_kwargs but is only (currently) used by DualChunkFlashAttentionImpl. I do think it would be less messy though to do this any make prefix a standard arg for attention impls, given that it is pretty generic. Thoughts @WoosukKwon

LucasWilkinson · 2025-01-20T22:44:49Z

vllm/attention/layer.py

+        if self.dual_chunk_attention_config:
+            assert query_succ_and_inter is not None
+            dca_kwargs = {
+                "query_succ": query_succ_and_inter[0],
+                "query_inter": query_succ_and_inter[1],
+                "query_succ_critical": query_succ_and_inter[2],
+                "query_inter_critical": query_succ_and_inter[3],
+            } if query_succ_and_inter else {}
+        else:
+            dca_kwargs = {}
+


I think we should try hard to see if there is cleaner way of passing these, maybe they can be bundled into a single q tensor that get reinterpreted as components via a combination of slicing and .view calls in the attn impl?

I would take a try to see if it can be simplified.

CMakeLists.txt

sighingnow · 2025-04-12T18:00:50Z

I think this is getting very close, thanks for rebasing it! My main concern right now is the large text files in the repo. Also there appear to still be unaddressed review comments from before, please ping us when this is ready for final review.

Hi @LucasWilkinson, thanks for these comments. I have rebased this branch over current main, removed those example prompts and provided them as URLs, and address the reviewer comments above in this PR. Now I think it should be ready for landing.

Before landing, a bugfix in flash-attention should be merged first: vllm-project/flash-attention#60. After that, I will revise the dependency version of vllm-flash-attention in this PR.

tlrmchlsmth

Thank you for rebasing on current main! The code looks pretty clean to me now.

Before this lands, I think we should make sure there's a plan in place to get support for this in vLLM V1. vLLM has switched to V1 by default and we are trying to deprecate V0.

A couple of questions:

What will happen with this PR when running Qwen2 on systems where the dual-chunk attention backend is not supported? (e.g. AMD GPUs, TPUs, etc)
Does vLLM automatically fall back to V0 when using dual-chunk attention?

tlrmchlsmth · 2025-04-12T18:52:31Z

examples/offline_inference/qwen_1m.py

+    with urlopen("https://qianwen-res.oss-cn-beijing.aliyuncs.com"
+                 "/Qwen2.5-1M/test-data/600k.txt") as response:


Please add a timeout to this

Suggested change

with urlopen("https://qianwen-res.oss-cn-beijing.aliyuncs.com"

"/Qwen2.5-1M/test-data/600k.txt") as response:

with urlopen("https://qianwen-res.oss-cn-beijing.aliyuncs.com"

"/Qwen2.5-1M/test-data/600k.txt", timeout=5) as response:

sighingnow · 2025-04-13T03:10:05Z

A couple of questions:

What will happen with this PR when running Qwen2 on systems where the dual-chunk attention backend is not supported? (e.g. AMD GPUs, TPUs, etc)

Does vLLM automatically fall back to V0 when using dual-chunk attention?

We have launched the work of migrating qwen related changes in our internal repo to v1 since v1 becomes the default option in vLLM. The dual-chunk-attn backend would be adapted to v1, too, and most of changesets could be reused.

I have added an assertion in arg_utils.py to check if the current platform is cuda(the sparse_attn_func is only available in vllm-project/flash-attention for cuda) and if the current engine is v0.

LucasWilkinson · 2025-04-15T02:01:55Z

vllm-project/flash-attention#60 has landed can you please update this PR?

sighingnow · 2025-04-15T18:18:45Z

vllm-project/flash-attention#60 has landed can you please update this PR?

Done, and rebased to main.

sighingnow · 2025-04-21T01:56:30Z

@LucasWilkinson I have rebased to current main again. Could you please take another look on this PR? Thanks!

mergify · 2025-05-01T02:54:57Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sighingnow.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

LucasWilkinson · 2025-05-01T23:12:05Z

cmake/external_projects/vllm_flash_attn.cmake

This can actually be left as 0a721daebe4fa7149f06ecf3d3eabeb6dcd0f1fa since that includes the PR you need

LucasWilkinson

Apologies overall this looks good now! Thanks for all the updates, the only things left to see on my end would be:

Address: https://github.com/vllm-project/vllm/pull/11844/files#r2070914950
Rebase
Address: https://github.com/vllm-project/vllm/pull/11844/files#r1943492013
@mgoin address: https://github.com/vllm-project/vllm/pull/11844/files#r2039650211

sighingnow · 2025-05-09T08:51:29Z

Hi @LucasWilkinson, thanks for the feedback. The first three comments has been addressed.

Address: https://github.com/vllm-project/vllm/pull/11844/files#r2070914950

Rebase

Address: https://github.com/vllm-project/vllm/pull/11844/files#r1943492013

@mgoin address: https://github.com/vllm-project/vllm/pull/11844/files#r2039650211

LucasWilkinson · 2025-05-09T21:22:33Z

@sighingnow Thanks for the update! looking into the CI failure it does not appear to be related (V1 code, this PR does not touch V1) but this is a bit out of my area of expertise, asking around (cc @russellb)

mergify · 2025-05-09T22:19:30Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @sighingnow.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…h sparse attention support. Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>

sighingnow · 2025-05-10T07:47:56Z

@sighingnow Thanks for the update! looking into the CI failure it does not appear to be related (V1 code, this PR does not touch V1) but this is a bit out of my area of expertise, asking around (cc @russellb)

Rebased against main again. The failed test cases shouldn't be caused by this PR. It failed on a speculative decoding cases and seems that that case is not executed by all PRs.

…h sparse attention support (vllm-project#11844)

sighingnow requested a review from tlrmchlsmth as a code owner January 8, 2025 13:09

mergify bot added the ci/build label Jan 8, 2025

sighingnow force-pushed the dev/dual-chunk-attn branch 2 times, most recently from 82b5a4c to 4c4a33e Compare January 9, 2025 06:17

minminsun mentioned this pull request Jan 12, 2025

Implements the attention kernel with vertical and slash sparse pattern described in Appendix C.4.2 of https://arxiv.org/abs/2407.02490 (as sparse_attn_func) vllm-project/flash-attention#33

Merged

sighingnow force-pushed the dev/dual-chunk-attn branch from 4c4a33e to 6b7c49e Compare January 13, 2025 12:32

mergify bot added the needs-rebase label Jan 13, 2025

sighingnow force-pushed the dev/dual-chunk-attn branch from 6b7c49e to 35aac26 Compare January 13, 2025 12:52

mergify bot removed the needs-rebase label Jan 13, 2025

sighingnow force-pushed the dev/dual-chunk-attn branch from 35aac26 to 91d5476 Compare January 13, 2025 16:27

jacob-crux reviewed Jan 15, 2025

View reviewed changes

mergify bot added the needs-rebase label Jan 15, 2025

sighingnow force-pushed the dev/dual-chunk-attn branch from 91d5476 to c8781cd Compare January 16, 2025 09:32

mergify bot removed the needs-rebase label Jan 16, 2025

sighingnow force-pushed the dev/dual-chunk-attn branch from c8781cd to 8648b1e Compare January 19, 2025 09:38

tlrmchlsmth reviewed Jan 20, 2025

View reviewed changes

mergify bot added the needs-rebase label Jan 20, 2025

mgoin reviewed Jan 20, 2025

View reviewed changes

examples/offline_inference_qwen_1m.py Outdated Show resolved Hide resolved

tlrmchlsmth reviewed Jan 20, 2025

View reviewed changes

LucasWilkinson reviewed Jan 20, 2025

View reviewed changes

LucasWilkinson requested changes Jan 23, 2025

View reviewed changes

CMakeLists.txt Outdated Show resolved Hide resolved

mergify bot removed the needs-rebase label Apr 12, 2025

tlrmchlsmth reviewed Apr 12, 2025

View reviewed changes

sighingnow force-pushed the dev/dual-chunk-attn branch from 795190d to a3efe49 Compare April 15, 2025 18:18

sighingnow force-pushed the dev/dual-chunk-attn branch 2 times, most recently from 5bbbfe2 to e30cd11 Compare April 21, 2025 01:55

iofu728 mentioned this pull request Apr 28, 2025

[Question]: Potential chance for better attention kernel microsoft/MInference#140

Open

mergify bot added the needs-rebase label May 1, 2025

LucasWilkinson reviewed May 1, 2025

View reviewed changes

LucasWilkinson approved these changes May 1, 2025

View reviewed changes

sighingnow force-pushed the dev/dual-chunk-attn branch from e30cd11 to 0dbb63f Compare May 9, 2025 08:50

mergify bot removed the needs-rebase label May 9, 2025

mergify bot added the needs-rebase label May 9, 2025

Implements dual-chunk-flash-attn backend for dual chunk attention wit…

44bf2ab

…h sparse attention support. Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>

sighingnow force-pushed the dev/dual-chunk-attn branch from 0dbb63f to 44bf2ab Compare May 10, 2025 07:46

mergify bot removed the needs-rebase label May 10, 2025

LucasWilkinson enabled auto-merge (squash) May 12, 2025 19:09

simon-mo merged commit 60f7624 into vllm-project:main May 13, 2025
86 of 91 checks passed

mawong-amd pushed a commit to ROCm/vllm that referenced this pull request May 14, 2025

Implements dual-chunk-flash-attn backend for dual chunk attention wit…

f045e51

…h sparse attention support (vllm-project#11844)

NickLucche mentioned this pull request May 15, 2025

[PD] Heterogenous TP + #7 robertgshaw2-redhat/vllm#14

Closed

sighingnow deleted the dev/dual-chunk-attn branch May 16, 2025 08:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implements dual-chunk-flash-attn backend for dual chunk attention with sparse attention support #11844

Implements dual-chunk-flash-attn backend for dual chunk attention with sparse attention support #11844

sighingnow commented Jan 8, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Jan 8, 2025

jacob-crux commented Jan 9, 2025

mergify bot commented Jan 13, 2025

sighingnow commented Jan 14, 2025

jacob-crux Jan 15, 2025

mergify bot commented Jan 15, 2025

jacob-crux commented Jan 15, 2025

sighingnow commented Jan 16, 2025

sighingnow commented Jan 19, 2025 •

edited

Loading

tlrmchlsmth left a comment

mergify bot commented Jan 20, 2025

LucasWilkinson Jan 20, 2025

LucasWilkinson Jan 20, 2025

LucasWilkinson Jan 20, 2025

sighingnow Jan 23, 2025

sighingnow commented Apr 12, 2025

tlrmchlsmth left a comment

tlrmchlsmth Apr 12, 2025

sighingnow Apr 13, 2025

sighingnow commented Apr 13, 2025

LucasWilkinson commented Apr 15, 2025

sighingnow commented Apr 15, 2025

sighingnow commented Apr 21, 2025

mergify bot commented May 1, 2025

LucasWilkinson May 1, 2025

LucasWilkinson left a comment

sighingnow commented May 9, 2025

LucasWilkinson commented May 9, 2025

mergify bot commented May 9, 2025

sighingnow commented May 10, 2025

		with urlopen("https://qianwen-res.oss-cn-beijing.aliyuncs.com"
		"/Qwen2.5-1M/test-data/600k.txt") as response:

Implements dual-chunk-flash-attn backend for dual chunk attention with sparse attention support #11844

Implements dual-chunk-flash-attn backend for dual chunk attention with sparse attention support #11844

Conversation

sighingnow commented Jan 8, 2025 • edited by github-actions bot Loading

github-actions bot commented Jan 8, 2025

jacob-crux commented Jan 9, 2025

mergify bot commented Jan 13, 2025

sighingnow commented Jan 14, 2025

Choose a reason for hiding this comment

mergify bot commented Jan 15, 2025

jacob-crux commented Jan 15, 2025

sighingnow commented Jan 16, 2025

sighingnow commented Jan 19, 2025 • edited Loading

tlrmchlsmth left a comment

Choose a reason for hiding this comment

mergify bot commented Jan 20, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sighingnow commented Apr 12, 2025

tlrmchlsmth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sighingnow commented Apr 13, 2025

LucasWilkinson commented Apr 15, 2025

sighingnow commented Apr 15, 2025

sighingnow commented Apr 21, 2025

mergify bot commented May 1, 2025

Choose a reason for hiding this comment

LucasWilkinson left a comment

Choose a reason for hiding this comment

sighingnow commented May 9, 2025

LucasWilkinson commented May 9, 2025

mergify bot commented May 9, 2025

sighingnow commented May 10, 2025

sighingnow commented Jan 8, 2025 •

edited by github-actions bot

Loading

sighingnow commented Jan 19, 2025 •

edited

Loading