Releases: b4rtaz/distributed-llama
0.13.6
This version adds two new arguments:
--net-turbo 0
- allows to disable non blocking sockets,--gpu-segments <from>:<to>
- allows specifying which segments of the neural network will be loaded onto the GPU. Currently, this option is dedicated only to skipping the first layer (embedding). Other settings may not work.
These options allowed to run Llama 3.3 70B Q40 on 4 x NVIDIA RTX 3060 12 GB. Check this test.
0.13.5
The Vulkan matmul shader (matmul-forward-q80-q40-f32.comp
) was optimized.
Tested on NVIDIA Tesla T4 16 GB using the llama3_1_8b_instruct_q40
model with --buffer-float-type q80
.
Before:
🔶 Pred 151 ms Sync 0 ms | Sent 0 kB Recv 0 kB | )
🔶 Pred 151 ms Sync 0 ms | Sent 0 kB Recv 0 kB | to
🔶 Pred 153 ms Sync 0 ms | Sent 0 kB Recv 0 kB | be
This version:
🔶 Pred 97 ms Sync 0 ms | Sent 0 kB Recv 0 kB | obtained
🔶 Pred 96 ms Sync 0 ms | Sent 0 kB Recv 0 kB | by
🔶 Pred 96 ms Sync 0 ms | Sent 0 kB Recv 0 kB | pert
0.13.4
0.13.3
0.13.2
This version fixes Vulkan support on Nvidia GPUs. It was successfully executed on Google Colab with an Nvidia T4 GPU.
How to run Distributed Llama on Google Colab?
# install drivers
!wget -qO- https://packages.lunarg.com/lunarg-signing-key-pub.asc | sudo tee /etc/apt/trusted.gpg.d/lunarg.asc
!sudo wget -qO /etc/apt/sources.list.d/lunarg-vulkan-1.4.309-jammy.list https://packages.lunarg.com/vulkan/1.4.309/lunarg-vulkan-1.4.309-jammy.list
!sudo apt update
!sudo apt install -y vulkan-sdk libnvidia-gl-525
# check
!nvidia-smi
!vulkaninfo | grep "GPU id"
# install
!git clone https://github.com/b4rtaz/distributed-llama.git
!cd distributed-llama && rm -rf *.o && rm -rf src/nn/vulkan/*.spv && DLLAMA_VULKAN=1 make dllama
# model
!cd distributed-llama && python3 launch.py llama3_1_8b_instruct_q40
# run
!cd distributed-llama && ./dllama inference --prompt "Tensor parallelism is all you need" --steps 128 \
--model models/llama3_1_8b_instruct_q40/dllama_model_llama3_1_8b_instruct_q40.m --tokenizer models/llama3_1_8b_instruct_q40/dllama_tokenizer_llama3_1_8b_instruct_q40.t \
--buffer-float-type q80 --max-seq-len 4096 --nthreads 1 --gpu-index 0
0.13.1
0.13.0 🌋
This version introduces experimental GPU support using Vulkan. While Vulkan integration is still a work in progress - especially regarding shader performance - this marks the first step toward full GPU support.
How to build Distributed Llama with Vulkan support?
DLLAMA_VULKAN=1 make dllama
To run Distributed Llama with Vulkan please add the --gpu-index 0
argument. For example:
./dllama inference ... --gpu-index 0
Please ensure that the Vulkan SDK is installed on your machine. You can run the following command to check:
vulkaninfo
0.12.8
This version extends metrics in inference
mode.
...
💿 Weights loaded
Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing
🔷️ Eval 534 ms Sync 100 ms | Sent 6912 kB Recv 12540 kB | (24 tokens)
🔶 Pred 68 ms Sync 25 ms | Sent 288 kB Recv 522 kB | them
🔶 Pred 58 ms Sync 15 ms | Sent 288 kB Recv 522 kB | with
🔶 Pred 57 ms Sync 11 ms | Sent 288 kB Recv 522 kB | TP
🔶 Pred 43 ms Sync 18 ms | Sent 288 kB Recv 522 kB | .
...
🔶 Pred 47 ms Sync 15 ms | Sent 288 kB Recv 522 kB | used
🔶 Pred 52 ms Sync 32 ms | Sent 288 kB Recv 522 kB | in
🔶 Pred 42 ms Sync 11 ms | Sent 288 kB Recv 522 kB | deep
🔶 Pred 44 ms Sync 10 ms | Sent 288 kB Recv 522 kB | learning
Evaluation
nBatches: 32
nTokens: 24
tokens/s: 37.83 (26.43 ms/tok)
Prediction
nTokens: 40
tokens/s: 16.10 (62.10 ms/tok)