Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Yi-Yi 2x34b+ merges generate very slowly. #293

Open
Ph0rk0z opened this issue Jan 19, 2024 · 8 comments
Open

Yi-Yi 2x34b+ merges generate very slowly. #293

Ph0rk0z opened this issue Jan 19, 2024 · 8 comments

Comments

@Ph0rk0z
Copy link

Ph0rk0z commented Jan 19, 2024

There's been a few people stacking YI models and the results are rather good. Unfortunately they are slower than a 70b especially when using their extended context.

Can anything be done? Is because of how the sizes of the matrices end up? I don't hear much about this issue from the GGUF users, correct me if I'm wrong.

Some examples:
https://huggingface.co/cloudyu/Mixtral_34Bx2_MoE_60B

https://huggingface.co/Weyaxi/Bagel-Hermes-2x34b

https://huggingface.co/Weyaxi/Cosmosis-3x34B

https://huggingface.co/Weyaxi/Astralis-4x34B

They do quantize correctly, at least the 2x: https://huggingface.co/LoneStriker/Bagel-Hermes-2x34b-6.0bpw-h6-exl2

@turboderp
Copy link
Member

I looked at the Bagel-Hermes EXL2 model and I'm getting speeds roughly equivalent to a 70B model at the same bitrate.

This is not unexpected since the number of experts per token is set to two. That means it has almost as many parameters as a 70B model and some extra operations during inference to make up the difference. You can run it with -ept 1 or change the num_experts_per_tok value in the config.json to limit it to one expert. Then it runs faster but it's anyone's guess if it works any better than either of the 34B models it was made from.

The 3x34B and 4x34B models at least should have almost the same per-token latency as 2x34B, so there's that.

As for the speed dropping with longer context, that's just how transformers work. GGUF isn't going to be any better in that respect, and (at least if you have it installed) ExLlama will use flash-attn which is still SOTA for exact attention on long contexts (i.e. not counting various context compression and sliding window methods.)

@Ph0rk0z
Copy link
Author

Ph0rk0z commented Jan 20, 2024

I did perplexity tests on this model. It has to be run at full experts to be of any benefit. I think even on wiki-text since the router isn't trained.

As to the speeds on a 5 bit 70b with 8k max_seq I get 14-15 t/s without any serious context piled on. Roughly 22t prompt. On bagel-hermes I only get 7-8. It is half as fast doing the exact same prompt and outputting 512 tokens. 30s vs 60s total reply time. If I was getting equivalent speeds I wouldn't have brought it up.

@turboderp
Copy link
Member

There is some overhead from the routing. It does roughly the same amount of processing as a 70B model but it's split into smaller portions so you might be seeing some overhead. What CPU are you running it on?

@Ph0rk0z
Copy link
Author

Ph0rk0z commented Jan 20, 2024

Dual 3090 and xeon v4 for the CPU.

@turboderp
Copy link
Member

That could be part of the reason at least. I'll have to do some profiling to see how different the CPU load is between 70B and 2x34B, but even the fastest Xeon v4 has fairly limited single-core performance.

@yamosin
Copy link

yamosin commented Jan 20, 2024

I have almost the same configuration as you, Xeon E5 2676 v3 + 3x3090 (only using two) and only get 3.5t/s in 2x34b, the same speed of using 4.65bpw or 6bpw, but I get 10~12t/s at goliath 3bpw
Although I can see the CPU core usage, a simple test proves that one or more cores do not affect t/s
This is running goliath and limiting the usage to a single core, but the t/s doesn't change
image

This is the limited 4 cores/1 core footprint when running 2x34b
image
image

INFO: Metrics: 50 tokens generated in 14.53 seconds (3.44 T/s, context 2467 tokens)
INFO:     127.0.0.1:53707 - "POST /v1/completions HTTP/1.1" 200 OK
INFO: Metrics: 50 tokens generated in 14.25 seconds (3.51 T/s, context 2467 tokens)

I hope this provides some relevant information

@Ph0rk0z
Copy link
Author

Ph0rk0z commented Jan 20, 2024

Am also getting about 12-13t/s on 103b @ 3.5 bpw. I loaded it with 8192 context. Maybe it's something about this architecture?

@sat0r1r1
Copy link

Same question, on 120B I can get about 12t/s.
However, Yi-34Bx2-MOE-60B can only get 3t/s.
I've tried exl2 3,4,5-bit and the result is the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants