feat: splitting multihead attention into all nodes. #46

b4rtaz · 2024-05-11T12:13:58Z

Test

Model: Llama 3 8B Q40
Buffer: Q80
Setup: 4 x Raspberry Pi 5 8GB + TP-Link LS1008G Switch

Transfer size / token

Devices	0.3.0	This PR	Percentage change
2 x Raspberry Pi 5	S 646 kB + R 476 kB = 1122 kB	S 578 kB + R 442 kB = 1020 kB	-9.09%
4 x Raspberry Pi 5	S 2295 kB + R 714 kB = 3009 kB	S 2193 kB + R 663 kB = 2856 kB	-5.08%

Avg tokens / secon

Devices		0.3.0	This PR	Percentage change
2 x Raspberry Pi 5	Avg generation time	444.27 ms	381.81 ms
	Avg inference time	362.73	349.94 ms	-3.53%
	Avg transfer time	80.11 ms	30.31 ms*
4 x Raspberry Pi 5	Avg generation time	331.47 ms	359.44 ms
	Avg inference time	267.62 ms	258.00 ms	-3.59%
	Avg transfer time	62.34 ms	99.69 ms

* I think the used switch is completely non-deterministic, it achieves a random speed at different times. So I recommend to compare only the avg inference time.

b4rtaz · 2024-05-11T14:07:06Z

To merge this PR I need to fix mixtral & grok architectures.

b4rtaz · 2024-05-11T20:43:44Z

I changed the implementation a bit, now there is no synchronization between llamaQuantizeMultiheadAtt and llamaAtt.

Transfer size / token

Devices	0.3.0	This PR v2	Percentage change
2 devices	S 646 kB + R 476 kB = 1122 kB	S 510 kB + R 442 kB = 952 kB	-15.15%
4 devices	S 2295 kB + R 714 kB = 3009 kB	S 1887 kB + R 867 kB = 2754 kB	-8.47%
8 devices	S 5771 kB + R 833 kB = 6604 kB	S 4819 kB + R 1487 kB = 6306 kB	-4.51%

The final state of the attention synchronization looks like this for a single block:

root --- xb  ---> node
root <-- xbv ---- node
merge att

The previous implementation:

root --- xb  --> node
root <-- q  ---- node
root <-- k  ---- node
root <-- v  ---- node
root --- xb ---> node
root <-- xb2 --- node
merge att

DifferentialityDevelopment · 2024-05-14T15:29:15Z

Not sure why but I pulled the latest code and now it won't generate any tokens, getting stuck here
float* logits = inference->infer(token, pos);

Thought it might be the changes I was working on, as I was cleaning up server.cpp but then I tried it on main and I get same behavior.

sudo nice -n -20 ./main inference --steps 10 --prompt "Hello World!" --model ~/Meta-Llama-3-8B-Instruct-Distributed/dllama_original_q40.bin --tokenizer ~/Meta-Llama-3-8B-Instruct-Distributed/dllama-llama3-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 8 --workers 192.168.1.3:9990
💡 arch: llama2
💡 dim: 4096
💡 hiddenDim: 14336
💡 nLayers: 32
💡 nHeads: 32
💡 nKvHeads: 8
💡 vocabSize: 128256
💡 seqLen: 2048
💡 nSlices: 2
💡 ropeTheta: 500000.0
📄 bosId: 128000
📄 eosId: 128001
🕒 ropeCache: 16384 kB
⏩ Loaded 6175568 kB

Then nothing happens, CPU usage goes up to around 70% but no tokens are generated, any idea what might be happening?

b4rtaz · 2024-05-14T15:41:24Z

@DifferentialityDevelopment have you pulled to this commit? Accidentally I disabled memory allocation.

DifferentialityDevelopment · 2024-05-14T15:42:12Z

No I think it might have been my bad as I just realized I forgot to rebuild with latest on the worker

b4rtaz added 3 commits May 11, 2024 10:54

assert.

74a4e84

splitting multihead att.

03d32a1

better thread split.

b6e150d

b4rtaz mentioned this pull request May 11, 2024

[Feature Suggest] Tensor Parallellism for Accelerating LLM #29

Closed

wo column split.

42a807c

b4rtaz added 2 commits May 11, 2024 23:02

pre-fix.

4bd8b2e

fix: falcon rope.

0c42081

b4rtaz marked this pull request as ready for review May 13, 2024 21:26

b4rtaz merged commit af8b317 into main May 13, 2024
2 checks passed

b4rtaz deleted the feat/qkv branch May 18, 2024 11:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: splitting multihead attention into all nodes. #46

feat: splitting multihead attention into all nodes. #46

b4rtaz commented May 11, 2024 •

edited

Loading

b4rtaz commented May 11, 2024

b4rtaz commented May 11, 2024

DifferentialityDevelopment commented May 14, 2024

b4rtaz commented May 14, 2024

DifferentialityDevelopment commented May 14, 2024

feat: splitting multihead attention into all nodes. #46

feat: splitting multihead attention into all nodes. #46

Conversation

b4rtaz commented May 11, 2024 • edited Loading

Test

Transfer size / token

Avg tokens / secon

b4rtaz commented May 11, 2024

b4rtaz commented May 11, 2024

Transfer size / token

DifferentialityDevelopment commented May 14, 2024

b4rtaz commented May 14, 2024

DifferentialityDevelopment commented May 14, 2024

b4rtaz commented May 11, 2024 •

edited

Loading