Skip to content

b2894 #121

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 15, 2024
Merged

b2894 #121

merged 2 commits into from
May 15, 2024

Conversation

Nexesenex
Copy link
Owner

No description provided.

agray3 and others added 2 commits May 15, 2024 15:44
As discussed in PR #6766, CUDA graphs were being disabled in the presence of long prompts.
This fixes the issue by avoiding the consective update counter from incrementing unnecessarily
for tokens in which cuda graphs are disabled due to batch size > 1.
* Just reordering some structs.

* Adding in the calls to mm_pause

* Passing around the state

* Renaming and moving a bunch of variables around.

* Extracting the logic to it's own function.

* Moving some variable definitions into the chunk function.

* Moving some variables around

* moving src1_cont inside

* Moving row_size

* adding the current_chunk

* Reorg the code.

* Formatting to match the orig patch

* starting to setup the chunking variables

* Starting the buildup of the loop

* The yield shouldn't be necessary.

* adding the looping structure based on the chunk configuration.

* Add in the re-chunking code.

* Making it much more likely to rechunk.

* disable resizing if numa is enabled.

* Updating comments with what we've learned.

* Fix formatting

* Couple more formatting fixes.

* More style fixes.

* Fix Warnings

* Going with unused because there's conditional logic that needs it.

* Update ggml.c

* Update ggml.c

---------
@Nexesenex Nexesenex merged commit b58ce7c into Nexesenex:downstream May 15, 2024
10 of 12 checks passed
Nexesenex pushed a commit that referenced this pull request Dec 22, 2024
* Adding q5_0_r4

We get PP-512(LLaMA-3.1-8B) = 256.7 t/s on a Ryzen-7950X.
We even get TG-128 improvement to 11.7 t/s from 11.1 t/s.

* q5_0_r4: NEON

We get PP-512(LLaMA-3.1-8B) = 99.6 t/s on M2-Max,
up from 71.0 t/s for Q5_0. The difference to mainline llama.cpp
is no longer funny: they get 26.5 t/s for Q5_0.

For TG, we are nor able to fully saturate memory bandwidth
and arrive at 22.1 t/s @ 8 threads. Mainline llama.cpp gets
20.6 t/s for Q5_0.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants