Using RPC feature to implement multi-cpu node inference. #12974

B4night · 2025-04-16T12:27:20Z

B4night
Apr 16, 2025

I'm currently implementing multi-node CPU inference for DeepSeek-R1 using llama.cpp, but I'm encountering unexpectedly poor performance of only 0.06 tokens/second.

Configuration Details:

Hardware: 20 nodes, each with 196 CPUs and 390GB memory.

During my experment, the CPU usage and memory usage of RPC servers were very low.

Test Command:

llama-bench -m /scratch/feic/pjs/DeepSeek-CPU-Inference/models/DeepSeek-R1-UD-IQ1_S.gguf -t 196,392

Key Questions:

Where can I find the RPC-related implementation in the codebase?
What parallelism strategy does the RPC feature employ (model parallelism vs. computation offload)?

Given the significant hardware resources available, the current performance seems abnormally low. I'd appreciate insights into potential bottlenecks or configuration issues that might explain these results.

rgerganov · 2025-04-16T12:58:38Z

rgerganov
Apr 16, 2025
Collaborator

Where can I find the RPC-related implementation in the codebase?

RPC server
RPC backend

What parallelism strategy does the RPC feature employ (model parallelism vs. computation offload)?

It is using the same strategy as with multiple GPUs -- the ggml scheduler is splitting model layers across devices (local GPUs or RPC servers)

Given the significant hardware resources available, the current performance seems abnormally low. I'd appreciate insights into potential bottlenecks or configuration issues that might explain these results.

The assumption that adding more RPC nodes will give you more tok/sec is wrong. Using RPC is beneficial when the model doesn't fit in the memory of the main host. As far as I understand, this is not the case with your setup -- your model is ~140GB and you have 390GB on a single host.

1 reply

B4night Apr 17, 2025
Author

Thank you for your kind reply.

I’ve looked into the source code of llama-cli, as it supports the --rpc option. However, I couldn’t find any logic related to dispatching model layers to other devices or handling communication with the RPC servers.

This is the part I’m particularly interested in, as I’m planning to implement expert parallelism using llama.cpp and would like to customize the code accordingly.

Could you give me some insignt where I can find these related logics?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using RPC feature to implement multi-cpu node inference. #12974

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Using RPC feature to implement multi-cpu node inference. #12974

B4night Apr 16, 2025

Replies: 1 comment · 1 reply

rgerganov Apr 16, 2025 Collaborator

B4night Apr 17, 2025 Author

B4night
Apr 16, 2025

Replies: 1 comment 1 reply

rgerganov
Apr 16, 2025
Collaborator

B4night Apr 17, 2025
Author