Replies: 1 comment 1 reply
-
It is using the same strategy as with multiple GPUs -- the ggml scheduler is splitting model layers across devices (local GPUs or RPC servers)
The assumption that adding more RPC nodes will give you more tok/sec is wrong. Using RPC is beneficial when the model doesn't fit in the memory of the main host. As far as I understand, this is not the case with your setup -- your model is ~140GB and you have 390GB on a single host. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm currently implementing multi-node CPU inference for DeepSeek-R1 using llama.cpp, but I'm encountering unexpectedly poor performance of only 0.06 tokens/second.
Configuration Details:
Hardware: 20 nodes, each with 196 CPUs and 390GB memory.
During my experment, the CPU usage and memory usage of RPC servers were very low.
Test Command:
llama-bench -m /scratch/feic/pjs/DeepSeek-CPU-Inference/models/DeepSeek-R1-UD-IQ1_S.gguf -t 196,392
Key Questions:
Where can I find the RPC-related implementation in the codebase?
What parallelism strategy does the RPC feature employ (model parallelism vs. computation offload)?
Given the significant hardware resources available, the current performance seems abnormally low. I'd appreciate insights into potential bottlenecks or configuration issues that might explain these results.
Beta Was this translation helpful? Give feedback.
All reactions