Split model between NUMA nodes for CPU inference #12303

CommanderLake · 2025-03-10T09:47:22Z

CommanderLake
Mar 10, 2025

If llama.cpp can split a model between GPUs can it split a model between NUMA nodes for CPU inference to eliminate inter-node memory access and take full advantage of the available memory bandwidth?
I ask this because i'm trying to optimize a llama.cpp build for a Xeon Max 9480 and regardless of what i enable in the build, OpenMP, AMX, NUMA etc. there are diminishing returns beyond a single tile worth of threads (28) when in theory if inter-node memory access could be minimised or eliminated by splitting the model evenly and isolating the threads to the correct nodes it could scale far better.

wkgcass · 2025-03-14T03:50:49Z

wkgcass
Mar 14, 2025

You may check this one: #12289
Instead of splitting computing and memory, it would be much easier to copy the model to multiple numa nodes.

The above modification is already implemented and opensourced.

1 reply

CommanderLake Mar 14, 2025
Author

Not if i want to use only the HBM for optimal performance, doing that would limit me to a 16GB model which means no 32B Q4_K_M or above.

usrlocalben · 2025-03-16T19:14:55Z

usrlocalben
Mar 16, 2025

Also follow #11733

1 reply

CommanderLake Mar 17, 2025
Author

But how do i use this workaround in my own code using common and llama function calls?
I just tried several methods that might mimic this workaround but it is still not scaling well on more than one tile on the 9480.

usrlocalben · 2025-03-17T15:28:48Z

usrlocalben
Mar 17, 2025

And #12088

1 reply

CommanderLake Mar 17, 2025
Author

NUMA and SNC cannot be disabled on the Xeon Max with only HBM.

CommanderLake · 2025-03-18T15:30:44Z

CommanderLake
Mar 18, 2025
Author

Using the windows performance counter "NUMA Node Memory\Available MBytes" i can monitor how the memory is allocated and even with the NUMA strategy set to "Distribute" i can clearly see it filling up ONE node and only spilling over to the others when its full.

However, if i use "_putenv("OMP_PROC_BIND=close");" to spread the threads evenly across all NUMA nodes the context or kv cache gets allocated evenly also and its only the model weights that dont get distributed!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split model between NUMA nodes for CPU inference #12303

{{title}}

Replies: 4 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Split model between NUMA nodes for CPU inference #12303

CommanderLake Mar 10, 2025

Replies: 4 comments · 3 replies

wkgcass Mar 14, 2025

CommanderLake Mar 14, 2025 Author

usrlocalben Mar 16, 2025

CommanderLake Mar 17, 2025 Author

usrlocalben Mar 17, 2025

CommanderLake Mar 17, 2025 Author

CommanderLake Mar 18, 2025 Author

CommanderLake
Mar 10, 2025

Replies: 4 comments 3 replies

wkgcass
Mar 14, 2025

CommanderLake Mar 14, 2025
Author

usrlocalben
Mar 16, 2025

CommanderLake Mar 17, 2025
Author

usrlocalben
Mar 17, 2025

CommanderLake Mar 17, 2025
Author

CommanderLake
Mar 18, 2025
Author