Skip to content

Releases: b4rtaz/distributed-llama

0.13.6

26 Apr 12:24
a16d2f0
Compare
Choose a tag to compare

This version adds two new arguments:

  • --net-turbo 0 - allows to disable non blocking sockets,
  • --gpu-segments <from>:<to> - allows specifying which segments of the neural network will be loaded onto the GPU. Currently, this option is dedicated only to skipping the first layer (embedding). Other settings may not work.

These options allowed to run Llama 3.3 70B Q40 on 4 x NVIDIA RTX 3060 12 GB. Check this test.

0.13.5

19 Apr 20:25
8909be9
Compare
Choose a tag to compare

The Vulkan matmul shader (matmul-forward-q80-q40-f32.comp) was optimized.

Tested on NVIDIA Tesla T4 16 GB using the llama3_1_8b_instruct_q40 model with --buffer-float-type q80.

Before:

🔶 Pred  151 ms Sync    0 ms | Sent     0 kB Recv     0 kB | )
🔶 Pred  151 ms Sync    0 ms | Sent     0 kB Recv     0 kB |  to
🔶 Pred  153 ms Sync    0 ms | Sent     0 kB Recv     0 kB |  be

This version:

🔶 Pred   97 ms Sync    0 ms | Sent     0 kB Recv     0 kB |  obtained
🔶 Pred   96 ms Sync    0 ms | Sent     0 kB Recv     0 kB |  by
🔶 Pred   96 ms Sync    0 ms | Sent     0 kB Recv     0 kB |  pert

0.13.4

13 Apr 21:37
50dfb13
Compare
Choose a tag to compare

This version includes a few optimizations in the Vulkan shader, especially for the OP_MULTIHEAD_ATT operation.

0.13.3

09 Apr 20:14
afa6297
Compare
Choose a tag to compare

This version fixes the selection of memory type in Vulkan, significantly improving inference speed on NVIDIA GPUs.

With this, it's now possible to run Distributed Llama on two GPUs within the same machine, check this test.

0.13.2

01 Apr 14:36
Compare
Choose a tag to compare

This version fixes Vulkan support on Nvidia GPUs. It was successfully executed on Google Colab with an Nvidia T4 GPU.


How to run Distributed Llama on Google Colab?

# install drivers
!wget -qO- https://packages.lunarg.com/lunarg-signing-key-pub.asc | sudo tee /etc/apt/trusted.gpg.d/lunarg.asc
!sudo wget -qO /etc/apt/sources.list.d/lunarg-vulkan-1.4.309-jammy.list https://packages.lunarg.com/vulkan/1.4.309/lunarg-vulkan-1.4.309-jammy.list
!sudo apt update
!sudo apt install -y vulkan-sdk libnvidia-gl-525

# check
!nvidia-smi
!vulkaninfo | grep "GPU id"

# install
!git clone https://github.com/b4rtaz/distributed-llama.git
!cd distributed-llama && rm -rf *.o && rm -rf src/nn/vulkan/*.spv && DLLAMA_VULKAN=1 make dllama

# model
!cd distributed-llama && python3 launch.py llama3_1_8b_instruct_q40

# run
!cd distributed-llama && ./dllama inference --prompt "Tensor parallelism is all you need" --steps 128 \
   --model models/llama3_1_8b_instruct_q40/dllama_model_llama3_1_8b_instruct_q40.m --tokenizer models/llama3_1_8b_instruct_q40/dllama_tokenizer_llama3_1_8b_instruct_q40.t \
   --buffer-float-type q80 --max-seq-len 4096 --nthreads 1 --gpu-index 0

0.13.1

26 Mar 15:04
e208f50
Compare
Choose a tag to compare

This version introduces an optimized matmul-forward-q80-q40-f32 shader for Vulkan. This change should speed up inference. More details here.

0.13.0 🌋

23 Mar 22:11
31ff8f4
Compare
Choose a tag to compare

This version introduces experimental GPU support using Vulkan. While Vulkan integration is still a work in progress - especially regarding shader performance - this marks the first step toward full GPU support.

How to build Distributed Llama with Vulkan support?

DLLAMA_VULKAN=1 make dllama

To run Distributed Llama with Vulkan please add the --gpu-index 0 argument. For example:

./dllama inference ... --gpu-index 0

Please ensure that the Vulkan SDK is installed on your machine. You can run the following command to check:

vulkaninfo

0.12.8

04 Mar 10:15
a91745d
Compare
Choose a tag to compare

This version extends metrics in inference mode.

...
💿 Weights loaded
Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing
🔷️ Eval  534 ms Sync  100 ms | Sent  6912 kB Recv 12540 kB | (24 tokens)
🔶 Pred   68 ms Sync   25 ms | Sent   288 kB Recv   522 kB |  them
🔶 Pred   58 ms Sync   15 ms | Sent   288 kB Recv   522 kB |  with
🔶 Pred   57 ms Sync   11 ms | Sent   288 kB Recv   522 kB |  TP
🔶 Pred   43 ms Sync   18 ms | Sent   288 kB Recv   522 kB | .
...
🔶 Pred   47 ms Sync   15 ms | Sent   288 kB Recv   522 kB |  used
🔶 Pred   52 ms Sync   32 ms | Sent   288 kB Recv   522 kB |  in
🔶 Pred   42 ms Sync   11 ms | Sent   288 kB Recv   522 kB |  deep
🔶 Pred   44 ms Sync   10 ms | Sent   288 kB Recv   522 kB |  learning

Evaluation
   nBatches: 32
    nTokens: 24
   tokens/s: 37.83 (26.43 ms/tok)
Prediction
    nTokens: 40
   tokens/s: 16.10 (62.10 ms/tok)

0.12.7

18 Feb 21:22
f8113c1
Compare
Choose a tag to compare

This version fixes the issue with loading large models, such as Llama 3.1 405B #169.

🚨 This version includes updates to the communication protocol. Please update all instances of Distributed Llama.

0.12.6

17 Feb 19:55
415946f
Compare
Choose a tag to compare
  • Fixed a bug in launch.py that caused the model to be redownloaded even when the user opted not to download it again.
  • The model name is now returned by the API.
  • The quantizeF32toQ80 function now includes an implementation that uses AVX2 instructions.