[v0.12.3] Vulkan 2 x GeForce RTX 3090 24 GB #194

b4rtaz · 2025-04-09T20:09:13Z

b4rtaz
Apr 9, 2025
Maintainer

Note

This result was improved a lot in the higher version.

In this test, I ran Llama 3.1 8B and Llama 3.3 70B Instruct Q40 on the same machine with 2 x NVIDIA GeForce RTX 3090 GPUs. I performed two tests: one using a single GPU, and another using both GPUs with full tensor parallelism on a single machine.

Because Distributed Llama still has some issues with Vulkan warmup, I took five measurements from within the inference process to calculate the inference speed.

	1 x GeForce RTX 3090 24 GB	2 x GeForce RTX 3090 24 GB
Llama 3.1 8B Q40 - prediction	17.8 tok / s	25.5 tok / s
Llama 3.3 70B Instruct Q40 - prediction	not enough memory	2.6 tok / s

Distributed Llama was built with Vulkan support. How I runned on 2 GPUs?

# terminal 1

./dllama inference --prompt "Tensor parallelism is all you need" --steps 128 --model models/llama3_3_70b_instruct_q40/dllama_model_llama3_3_70b_instruct_q40.m --tokenizer models/llama3_3_70b_instruct_q40/dllama_tokenizer_llama3_3_70b_instruct_q40.t --nthreads 1 --buffer-float-type q80 --max-seq-len 256 --gpu-index 0 --workers 127.0.0.1:9999

# terminal 2

./dllama worker --port 9999 --nthreads 1 --gpu-index 1

1 GPU

Llama 3.1 8B Instruct Q40

🔶 Pred   56 ms Sync    0 ms | Sent     0 kB Recv     0 kB | The
🔶 Pred   56 ms Sync    0 ms | Sent     0 kB Recv     0 kB |  tensor
🔶 Pred   56 ms Sync    0 ms | Sent     0 kB Recv     0 kB |  parallel
🔶 Pred   56 ms Sync    0 ms | Sent     0 kB Recv     0 kB | ism
🔶 Pred   56 ms Sync    0 ms | Sent     0 kB Recv     0 kB |  is

2 GPUs

Llama 3.1 8B Instruct Q40

🔶 Pred   38 ms Sync    4 ms | Sent   288 kB Recv   522 kB | However
🔶 Pred   40 ms Sync    3 ms | Sent   288 kB Recv   522 kB | ,
🔶 Pred   39 ms Sync    3 ms | Sent   288 kB Recv   522 kB |  if
🔶 Pred   39 ms Sync    4 ms | Sent   288 kB Recv   522 kB |  the
🔶 Pred   40 ms Sync    4 ms | Sent   288 kB Recv   522 kB |  processor

Llama 3.3 70B Instruct Q40

🔶 Pred  385 ms Sync   29 ms | Sent  1392 kB Recv  1610 kB |  Model
🔶 Pred  385 ms Sync   16 ms | Sent  1392 kB Recv  1610 kB |  parallel
🔶 Pred  383 ms Sync   11 ms | Sent  1392 kB Recv  1610 kB | ism
🔶 Pred  389 ms Sync    8 ms | Sent  1392 kB Recv  1610 kB |  refers
🔶 Pred  388 ms Sync    9 ms | Sent  1392 kB Recv  1610 kB |  to

Spec

(main) root@C.19338540:/workspace/distributed-llama$ nvidia-smi
Wed Apr  9 19:42:42 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.86.15              Driver Version: 570.86.15      CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        On  |   00000000:29:00.0 Off |                  N/A |
|  0%   44C    P8             42W /  300W |       2MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090        On  |   00000000:41:00.0 Off |                  N/A |
|  0%   52C    P8             22W /  300W |       2MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

(main) root@C.19338540:/workspace/distributed-llama$ lscpu
Architecture:             x86_64
  CPU op-mode(s):         32-bit, 64-bit
  Address sizes:          43 bits physical, 48 bits virtual
  Byte Order:             Little Endian
CPU(s):                   64
  On-line CPU(s) list:    0-63
Vendor ID:                AuthenticAMD
  Model name:             AMD Ryzen Threadripper PRO 3975WX 32-Cores
    CPU family:           23
    Model:                49
    Thread(s) per core:   2
    Core(s) per socket:   32
    Socket(s):            1
    Stepping:             0
    Frequency boost:      enabled
    CPU max MHz:          4368.1641
    CPU min MHz:          2200.0000
    BogoMIPS:             6986.69
    Flags:                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt p
                          dpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 ss
                          e4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowpr
                          efetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibpb
                           stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_l
                          lc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save ts
                          c_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_re
                          cov succor smca sev sev_es

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[v0.12.3] Vulkan 2 x GeForce RTX 3090 24 GB #194

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

[v0.12.3] Vulkan 2 x GeForce RTX 3090 24 GB #194

Uh oh!

Uh oh!

b4rtaz Apr 9, 2025 Maintainer

1 GPU

2 GPUs

Spec

Replies: 0 comments

b4rtaz
Apr 9, 2025
Maintainer