Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able see Scaling performance with NuC (12th Gen) with deepseek_r1_distill_llama_8b_q40 #179

Open
deepaks2 opened this issue Mar 3, 2025 · 7 comments

Comments

@deepaks2
Copy link

deepaks2 commented Mar 3, 2025

I am trying to reproduce the resources on NuC but i see number of token/sec drops when i add more nodes. any help?

System:
4xNuC ((12th Gen)) with AVX2 support.

1xNuC ((12th Gen)) with AVX2 support. -->
./dllama inference --model models/deepseek_r1_distill_llama_8b_q40/dllama_model_deepseek_r1_distill_llama_8b_q40.m --tokenizer models/deepseek_r1_distill_llama_8b_q40/dllama_tokenizer_deepseek_r1_distill_llama_8b_q40.t --buffer-float-type q80 --nthreads 8 --max-seq-len 4096 --prompt "What is 5+9?" --steps 77
Evaluation
nBatches: 32
nTokens: 7
tokens/s: 14.96 (66.86 ms/tok)
Prediction
nTokens: 70
tokens/s: 5.51 (181.43 ms/tok)

2xNuC ((12th Gen)) with AVX2 support. -->
./dllama inference --model models/deepseek_r1_distill_llama_8b_q40/dllama_model_deepseek_r1_distill_llama_8b_q40.m --tokenizer models/deepseek_r1_distill_llama_8b_q40/dllama_tokenizer_deepseek_r1_distill_llama_8b_q40.t --buffer-float-type q80 --nthreads 8 --max-seq-len 4096 --prompt "What is 5+9?" --steps 77 --workers 10.10.10.2:9998

Evaluation
nBatches: 32
nTokens: 7
tokens/s: 9.25 (108.14 ms/tok)
Prediction
nTokens: 70
tokens/s: 5.96 (167.91 ms/tok)

4xNuC ((12th Gen)) with AVX2 support. -->
./dllama inference --model models/deepseek_r1_distill_llama_8b_q40/dllama_model_deepseek_r1_distill_llama_8b_q40.m --tokenizer models/deepseek_r1_distill_llama_8b_q40/dllama_tokenizer_deepseek_r1_distill_llama_8b_q40.t --buffer-float-type q80 --nthreads 8 --max-seq-len 4096 --prompt "What is 5+9?" --steps 77 --workers 10.10.10.2:9998 10.10.10.4:9998 10.10.10.5:9998

Evaluation
nBatches: 32
nTokens: 7
tokens/s: 6.74 (148.29 ms/tok)
Prediction
nTokens: 70
tokens/s: 5.02 (199.27 ms/tok)

Any help here. is this expected?

@deepaks2 deepaks2 changed the title Not able see Scaling performance with Intel NuC (12th Gen) with deepseek_r1_distill_llama_8b_q40 Not able see Scaling performance with NuC (12th Gen) with deepseek_r1_distill_llama_8b_q40 Mar 3, 2025
@D-i-t-gh
Copy link

D-i-t-gh commented Mar 3, 2025

How did you start the workers?

@deepaks2
Copy link
Author

deepaks2 commented Mar 4, 2025

One each of the worker, i ran "./dllama worker --port 9998 --nthreads 8"
on the Root node, "./dllama inference --model models/deepseek_r1_distill_llama_8b_q40/dllama_model_deepseek_r1_distill_llama_8b_q40.m --tokenizer models/deepseek_r1_distill_llama_8b_q40/dllama_tokenizer_deepseek_r1_distill_llama_8b_q40.t --buffer-float-type q80 --nthreads 8 --max-seq-len 4096 --prompt "What is 5+9?" --steps 77 --workers 10.10.10.2:9998 10.10.10.4:9998 10.10.10.5:9998"

@b4rtaz
Copy link
Owner

b4rtaz commented Mar 4, 2025

Hello @deepaks2,

please upgrade DL to 0.12.8 and put here logs from inference mode. This version shows the time needed for inference and synchronization.

@deepaks2
Copy link
Author

deepaks2 commented Mar 5, 2025

@b4rtaz Thanks I will share the details

@deepaks2
Copy link
Author

deepaks2 commented Mar 5, 2025

@b4rtaz Please find the logs

2xNuC ((12th Gen)) with AVX2 support. -->

Image

4xNuC ((12th Gen)) with AVX2 support. -->

Image

All 4 NuC are connected via switch.

@b4rtaz
Copy link
Owner

b4rtaz commented Mar 5, 2025

It seems that synchronization over Ethernet is very slow. Maybe you should try connecting the two devices directly without a router and compare the results. If I see correctly, the NUC 12th Gen should have 2.5G Ethernet. Thunderbolt 4 can also be used for networking, but it is not easy to configure (I haven't tried it myself).

@deepaks2
Copy link
Author

deepaks2 commented Mar 6, 2025

Thanks @b4rtaz. I trieed connecting two devices directly without a router and results are slightly better. It improved by 1token/sec

Image

I see only slightly better results from 5.98 token/sec (with router) & 6.27. tokens/sec (direct). I see only 10ms difference in sync.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants