-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: splitting multihead attention into all nodes. #46
Conversation
To merge this PR I need to fix mixtral & grok architectures. |
I changed the implementation a bit, now there is no synchronization between Transfer size / token
The final state of the attention synchronization looks like this for a single block:
The previous implementation:
|
Not sure why but I pulled the latest code and now it won't generate any tokens, getting stuck here Thought it might be the changes I was working on, as I was cleaning up server.cpp but then I tried it on main and I get same behavior. sudo nice -n -20 ./main inference --steps 10 --prompt "Hello World!" --model ~/Meta-Llama-3-8B-Instruct-Distributed/dllama_original_q40.bin --tokenizer ~/Meta-Llama-3-8B-Instruct-Distributed/dllama-llama3-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 8 --workers 192.168.1.3:9990 Then nothing happens, CPU usage goes up to around 70% but no tokens are generated, any idea what might be happening? |
@DifferentialityDevelopment have you pulled to this commit? Accidentally I disabled memory allocation. |
No I think it might have been my bad as I just realized I forgot to rebuild with latest on the worker |
Test
Model: Llama 3 8B Q40
Buffer: Q80
Setup: 4 x Raspberry Pi 5 8GB + TP-Link LS1008G Switch
Transfer size / token
Avg tokens / secon
* I think the used switch is completely non-deterministic, it achieves a random speed at different times. So I recommend to compare only the avg inference time.