-
-
Notifications
You must be signed in to change notification settings - Fork 305
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Yi-Yi 2x34b+ merges generate very slowly. #293
Comments
I looked at the Bagel-Hermes EXL2 model and I'm getting speeds roughly equivalent to a 70B model at the same bitrate. This is not unexpected since the number of experts per token is set to two. That means it has almost as many parameters as a 70B model and some extra operations during inference to make up the difference. You can run it with The 3x34B and 4x34B models at least should have almost the same per-token latency as 2x34B, so there's that. As for the speed dropping with longer context, that's just how transformers work. GGUF isn't going to be any better in that respect, and (at least if you have it installed) ExLlama will use flash-attn which is still SOTA for exact attention on long contexts (i.e. not counting various context compression and sliding window methods.) |
I did perplexity tests on this model. It has to be run at full experts to be of any benefit. I think even on wiki-text since the router isn't trained. As to the speeds on a 5 bit 70b with 8k max_seq I get 14-15 t/s without any serious context piled on. Roughly 22t prompt. On bagel-hermes I only get 7-8. It is half as fast doing the exact same prompt and outputting 512 tokens. 30s vs 60s total reply time. If I was getting equivalent speeds I wouldn't have brought it up. |
There is some overhead from the routing. It does roughly the same amount of processing as a 70B model but it's split into smaller portions so you might be seeing some overhead. What CPU are you running it on? |
Dual 3090 and xeon v4 for the CPU. |
That could be part of the reason at least. I'll have to do some profiling to see how different the CPU load is between 70B and 2x34B, but even the fastest Xeon v4 has fairly limited single-core performance. |
Am also getting about 12-13t/s on 103b @ 3.5 bpw. I loaded it with 8192 context. Maybe it's something about this architecture? |
Same question, on 120B I can get about 12t/s. |
There's been a few people stacking YI models and the results are rather good. Unfortunately they are slower than a 70b especially when using their extended context.
Can anything be done? Is because of how the sizes of the matrices end up? I don't hear much about this issue from the GGUF users, correct me if I'm wrong.
Some examples:
https://huggingface.co/cloudyu/Mixtral_34Bx2_MoE_60B
https://huggingface.co/Weyaxi/Bagel-Hermes-2x34b
https://huggingface.co/Weyaxi/Cosmosis-3x34B
https://huggingface.co/Weyaxi/Astralis-4x34B
They do quantize correctly, at least the 2x: https://huggingface.co/LoneStriker/Bagel-Hermes-2x34b-6.0bpw-h6-exl2
The text was updated successfully, but these errors were encountered: