Possible solution for GGML_NUMA_MIRROR #12289

wkgcass · 2025-03-09T19:57:54Z

wkgcass
Mar 9, 2025

TLDR: Replicate models on each NUMA. On my platform, pure CPU inference of QwQ-32B FP16 improved from ~6.6 token/s to ~10.7 token/s, and DeepSeek R1 671B Q8 from ~7.2 token/s to ~9.7 token/s. You can find the modified llama.cpp version here.

On a dual socket system, cross-NUMA access is extremely slow.
The most memory-bandwidth-consuming component during LLM inference execution is the model itself's memory usage.

To maximize utilization of the multi-CPU-platform’s total memory bandwidth:
We can replicate one copy of our neural network per available numa node
Then use local copies when doing calculations (i.e., always work with your current NUMA node's own replica)

This would theoretically achieve double bandwidth than a single numa.

Model addresses are primarily stored in tensor->data. To enable access via different NUMAs' data, we modify:

tensor->__data[2](assuming two NUMAs) instead of single pointer.

When setting values for tensor->data:
Assign each NUMA its respective memory region address

How to know the per-NUMA memory locations?
Leverage Linux mmap’s capability allowing specifying virtual addresses.
During mapping phase allocate specific regions for model copies across numa nodes
Example:
NUMA node 0 uses base address starting at 0x200000000000
NUMA node 1: 0x400000000000

Assignment logic would be：
If a given data-pointer falls within [0x2... ~0x4...] range,
then __data[0] retains original value, while
__data[1]=original_data + (offset between the two NUMAs' bases)

When accessing tensor->data during runtime，
the thread's current NUMA ID is retrieved via its TLS storage.
We should also bind threads to specific cores/numa nodes.

To implement this:
Modify all instances of tensor->data accesses in codebase
Create helper functions：

tensor_data() returns appropriate address for local numa node.
tensor_set_data() populates both copies when setting values

This requires changing about 700 lines of code.

Testing Platform: 9275f × 2 + DDR5-6000MT/s×(2x12 channels)
Model Used:QwQ-32B FP16
Codebase Commit:1e2f78a00450593e2dfa458796fcdd9987300dfc

Test Scenarios:
Scenario 1 - Single NUMA Mode：
BIOS configures all memory into single unified node with data spread between both physical nodes

QwQ-32B FP16 generate = 1085, speed = 6.66, power = 798W
DeepSeek R1 671B Q8 generate = 1022, speed = 7.19, power = 753W

Scenario 2 - Two NUMA nodes with numactl --interleave=0,1

QwQ-32B generate = 1399, speed = 6.82, power = 806W
DeepSeek R1 671B Q8 generate = 1056, speed = 7.23, power = 728W

Scenario 3 - New GGML_NUMA_MIRROR scheme proposed above

QwQ-32B generate = 1344, speed = 10.80, power = 884W
DeepSeek R1 671B Q8 generate = 1084, speed = 9.67, power = 762W

* power consumption includes the whole server case and one not-in-use GPU.

Here's the code for you to try out: vproxy-tools/llama.cpp

cmake -B build -DGGML_NATIVE=ON -DGGML_CUDA=ON -DGGML_NUMA_MIRROR=ON

The tensor->data modifications are stored here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible solution for GGML_NUMA_MIRROR #12289

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Possible solution for GGML_NUMA_MIRROR #12289

wkgcass Mar 9, 2025

Replies: 0 comments

wkgcass
Mar 9, 2025