Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: splitting multihead attention into all nodes. #46

Merged
merged 6 commits into from
May 13, 2024
Merged

Conversation

b4rtaz
Copy link
Owner

@b4rtaz b4rtaz commented May 11, 2024

Test

Model: Llama 3 8B Q40
Buffer: Q80
Setup: 4 x Raspberry Pi 5 8GB + TP-Link LS1008G Switch

Transfer size / token

Devices 0.3.0 This PR Percentage change
2 x Raspberry Pi 5 S 646 kB + R 476 kB = 1122 kB S 578 kB + R 442 kB = 1020 kB -9.09%
4 x Raspberry Pi 5 S 2295 kB + R 714 kB = 3009 kB S 2193 kB + R 663 kB = 2856 kB -5.08%

Avg tokens / secon

Devices 0.3.0 This PR Percentage change
2 x Raspberry Pi 5 Avg generation time 444.27 ms 381.81 ms
Avg inference time 362.73 349.94 ms -3.53%
Avg transfer time 80.11 ms 30.31 ms*
4 x Raspberry Pi 5 Avg generation time 331.47 ms 359.44 ms
Avg inference time 267.62 ms 258.00 ms -3.59%
Avg transfer time 62.34 ms 99.69 ms

* I think the used switch is completely non-deterministic, it achieves a random speed at different times. So I recommend to compare only the avg inference time.

@b4rtaz
Copy link
Owner Author

b4rtaz commented May 11, 2024

To merge this PR I need to fix mixtral & grok architectures.

@b4rtaz
Copy link
Owner Author

b4rtaz commented May 11, 2024

I changed the implementation a bit, now there is no synchronization between llamaQuantizeMultiheadAtt and llamaAtt.

Transfer size / token

Devices 0.3.0 This PR v2 Percentage change
2 devices S 646 kB + R 476 kB = 1122 kB S 510 kB + R 442 kB = 952 kB -15.15%
4 devices S 2295 kB + R 714 kB = 3009 kB S 1887 kB + R 867 kB = 2754 kB -8.47%
8 devices S 5771 kB + R 833 kB = 6604 kB S 4819 kB + R 1487 kB = 6306 kB -4.51%

The final state of the attention synchronization looks like this for a single block:

root --- xb  ---> node
root <-- xbv ---- node
merge att

The previous implementation:

root --- xb  --> node
root <-- q  ---- node
root <-- k  ---- node
root <-- v  ---- node
root --- xb ---> node
root <-- xb2 --- node
merge att

@b4rtaz b4rtaz marked this pull request as ready for review May 13, 2024 21:26
@b4rtaz b4rtaz merged commit af8b317 into main May 13, 2024
2 checks passed
@DifferentialityDevelopment
Copy link
Contributor

Not sure why but I pulled the latest code and now it won't generate any tokens, getting stuck here
float* logits = inference->infer(token, pos);

Thought it might be the changes I was working on, as I was cleaning up server.cpp but then I tried it on main and I get same behavior.

sudo nice -n -20 ./main inference --steps 10 --prompt "Hello World!" --model ~/Meta-Llama-3-8B-Instruct-Distributed/dllama_original_q40.bin --tokenizer ~/Meta-Llama-3-8B-Instruct-Distributed/dllama-llama3-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 8 --workers 192.168.1.3:9990
💡 arch: llama2
💡 dim: 4096
💡 hiddenDim: 14336
💡 nLayers: 32
💡 nHeads: 32
💡 nKvHeads: 8
💡 vocabSize: 128256
💡 seqLen: 2048
💡 nSlices: 2
💡 ropeTheta: 500000.0
📄 bosId: 128000
📄 eosId: 128001
🕒 ropeCache: 16384 kB
⏩ Loaded 6175568 kB

Then nothing happens, CPU usage goes up to around 70% but no tokens are generated, any idea what might be happening?

@b4rtaz
Copy link
Owner Author

b4rtaz commented May 14, 2024

@DifferentialityDevelopment have you pulled to this commit? Accidentally I disabled memory allocation.

@DifferentialityDevelopment
Copy link
Contributor

No I think it might have been my bad as I just realized I forgot to rebuild with latest on the worker

@b4rtaz b4rtaz deleted the feat/qkv branch May 18, 2024 11:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants