You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TLDR: Replicate models on each NUMA. On my platform, pure CPU inference of QwQ-32B FP16 improved from ~6.6 token/s to ~10.7 token/s, and DeepSeek R1 671B Q8 from ~7.2 token/s to ~9.7 token/s. You can find the modified llama.cpp version here.
On a dual socket system, cross-NUMA access is extremely slow.
The most memory-bandwidth-consuming component during LLM inference execution is the model itself's memory usage.
To maximize utilization of the multi-CPU-platform’s total memory bandwidth:
We can replicate one copy of our neural network per available numa node
Then use local copies when doing calculations (i.e., always work with your current NUMA node's own replica)
This would theoretically achieve double bandwidth than a single numa.
Model addresses are primarily stored in tensor->data. To enable access via different NUMAs' data, we modify:
tensor->__data[2](assuming two NUMAs) instead of single pointer.
When setting values for tensor->data:
Assign each NUMA its respective memory region address
How to know the per-NUMA memory locations?
Leverage Linux mmap’s capability allowing specifying virtual addresses.
During mapping phase allocate specific regions for model copies across numa nodes
Example:
NUMA node 0 uses base address starting at 0x200000000000
NUMA node 1: 0x400000000000
Assignment logic would be:
If a given data-pointer falls within [0x2... ~0x4...] range,
then __data[0] retains original value, while __data[1]=original_data + (offset between the two NUMAs' bases)
When accessing tensor->data during runtime,
the thread's current NUMA ID is retrieved via its TLS storage.
We should also bind threads to specific cores/numa nodes.
To implement this:
Modify all instances of tensor->data accesses in codebase
Create helper functions:
tensor_data() returns appropriate address for local numa node.
tensor_set_data() populates both copies when setting values
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
TLDR: Replicate models on each NUMA. On my platform, pure CPU inference of
QwQ-32B FP16
improved from ~6.6 token/s to ~10.7 token/s, andDeepSeek R1 671B Q8
from ~7.2 token/s to ~9.7 token/s. You can find the modified llama.cpp version here.On a dual socket system, cross-NUMA access is extremely slow.
The most memory-bandwidth-consuming component during LLM inference execution is the model itself's memory usage.
To maximize utilization of the multi-CPU-platform’s total memory bandwidth:
We can replicate one copy of our neural network per available numa node
Then use local copies when doing calculations (i.e., always work with your current NUMA node's own replica)
This would theoretically achieve double bandwidth than a single numa.
Model addresses are primarily stored in
tensor->data
. To enable access via different NUMAs' data, we modify:tensor->__data[2]
(assuming two NUMAs) instead of single pointer.When setting values for tensor->data:
Assign each NUMA its respective memory region address
How to know the per-NUMA memory locations?
Leverage Linux mmap’s capability allowing specifying virtual addresses.
During mapping phase allocate specific regions for model copies across numa nodes
Example:
NUMA node 0 uses base address starting at 0x200000000000
NUMA node 1: 0x400000000000
Assignment logic would be:
If a given data-pointer falls within
[0x2... ~0x4...]
range,then
__data[0]
retains original value, while__data[1]=original_data + (offset between the two NUMAs' bases)
When accessing
tensor->data
during runtime,the thread's current NUMA ID is retrieved via its TLS storage.
We should also bind threads to specific cores/numa nodes.
To implement this:
Modify all instances of
tensor->data
accesses in codebaseCreate helper functions:
This requires changing about 700 lines of code.
Testing Platform: 9275f × 2 + DDR5-6000MT/s×(2x12 channels)
Model Used:QwQ-32B FP16
Codebase Commit:
1e2f78a00450593e2dfa458796fcdd9987300dfc
Test Scenarios:
Scenario 1 - Single NUMA Mode:
BIOS configures all memory into single unified node with data spread between both physical nodes
QwQ-32B FP16 generate = 1085, speed = 6.66, power = 798W
DeepSeek R1 671B Q8 generate = 1022, speed = 7.19, power = 753W
Scenario 2 - Two NUMA nodes with
numactl --interleave=0,1
QwQ-32B generate = 1399, speed = 6.82, power = 806W
DeepSeek R1 671B Q8 generate = 1056, speed = 7.23, power = 728W
Scenario 3 - New GGML_NUMA_MIRROR scheme proposed above
QwQ-32B generate = 1344, speed = 10.80, power = 884W
DeepSeek R1 671B Q8 generate = 1084, speed = 9.67, power = 762W
Here's the code for you to try out: vproxy-tools/llama.cpp
The
tensor->data
modifications are stored hereBeta Was this translation helpful? Give feedback.
All reactions