Split model between NUMA nodes for CPU inference #12303
Replies: 4 comments 3 replies
-
You may check this one: #12289 The above modification is already implemented and opensourced. |
Beta Was this translation helpful? Give feedback.
-
Also follow #11733 |
Beta Was this translation helpful? Give feedback.
-
And #12088 |
Beta Was this translation helpful? Give feedback.
-
Using the windows performance counter "NUMA Node Memory\Available MBytes" i can monitor how the memory is allocated and even with the NUMA strategy set to "Distribute" i can clearly see it filling up ONE node and only spilling over to the others when its full. However, if i use "_putenv("OMP_PROC_BIND=close");" to spread the threads evenly across all NUMA nodes the context or kv cache gets allocated evenly also and its only the model weights that dont get distributed! |
Beta Was this translation helpful? Give feedback.
-
If llama.cpp can split a model between GPUs can it split a model between NUMA nodes for CPU inference to eliminate inter-node memory access and take full advantage of the available memory bandwidth?
I ask this because i'm trying to optimize a llama.cpp build for a Xeon Max 9480 and regardless of what i enable in the build, OpenMP, AMX, NUMA etc. there are diminishing returns beyond a single tile worth of threads (28) when in theory if inter-node memory access could be minimised or eliminated by splitting the model evenly and isolating the threads to the correct nodes it could scale far better.
Beta Was this translation helpful? Give feedback.
All reactions