Replies: 9 comments 25 replies
-
@ubergarm may just want to note in the math that Bpw is converted to bytes from bits with your constant term (/8). Initially confused me. Otherwise great write up! |
Beta Was this translation helpful? Give feedback.
-
@ubergarm great work you have done! I am living in it seems that the intel amx team are implementing fp8 support. if so, the hardware is really great for I am curious about the concurrent performance. |
Beta Was this translation helpful? Give feedback.
-
Nice writeup, I too have access to a similar system for a bit. I did not have much luck getting deepseekR1 to convert to gguf or I would just hand you my numbers. However, have you tried building llama.cpp with oneAPI Intel compiler? It will use oneMKL as the BLAS backend. Steps Are:
source /opt/intel/oneapi/setvars.sh
cmake -S . -B build -G Ninja -DCMAKE_BUILD_TYPE=Release -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=Intel10_64lp -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_NATIVE=ON -DCMAKE_INSTALL_PREFIX=~/llama_build/ -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_EXAMPLES=ON -DLLAMA_BUILD_SERVER=ON GGML_CCACHE=OFF
cmake --build build --config Release
|
Beta Was this translation helpful? Give feedback.
-
Some time ago I started similar discussion with numbers for AMD Epyc: #11733 I identified several areas where performance is reduced because of NUMA and got discouraged by the effort needed to straighten this up (I guess I'm not the first one). |
Beta Was this translation helpful? Give feedback.
-
I've experienced similar results running Deepseek R1 (unsloth 2.22b). Regarding the comment about SMT being faster than having it disabled, this was a property of intel cpus and the opposite happened with AMD. For details on this check the paper "Placement of Virtual Containers on NUMA systems: A Practical and Comprehensive Model" Now... I'd love to be able to have both numa nodes working, since using them both produces less tokens/s than using a single node with half the processes. Right now I found the best performance disabling ALL cuda devices, using numactl to use a single node, turning off the cpu security mitigations, setting the k cache type to fp16, disabling autobalancing,(still didn´t try adding interleave). On my hardware I'm getting around 2.3 tokens/s. Today I've been having to also disable mmap, otherwise the IO halts the entire process (this wasn't the case yesterday. I need to investigate). My hardware:
|
Beta Was this translation helpful? Give feedback.
-
hello: |
Beta Was this translation helpful? Give feedback.
-
There are some OS kernel tweaks that might be worth trying:
|
Beta Was this translation helpful? Give feedback.
-
AMX tile config is here in llama.cpp If the tensor OP type is GGML_OP_MUL_MAT, it will be invoked on Intel AMX supported platform. |
Beta Was this translation helpful? Give feedback.
-
hello:
what's more, llama.cpp seems more sophisticated in DRAM setting, it's prefere using Dram Cache instead of direct DRAM injecting. |
Beta Was this translation helpful? Give feedback.
-
Intel Xeon performance on R1 671B quants?
Last Updated On: Tue Mar 18 12:11:53 AM EDT 2025
tl;dr;
UPDATE: Fastest CPU only benchmarks to date are with FlashMLA-2 and other optimizations on ik_llama.cpp fork.
UPDATE: Interesting post regarding AMX optimizations and DeepSeek-R1.
UPDATE: Definitely check out @fairydreaming's deep dive on similar issues here
llama.cpp seems to run best with all memory in a single NUMA Node as of Q1 2025.
So configure in BIOS a single NUMA node per CPU socket and only use a single CPU socket e.g.
SNC=Disable
on 6th generation Intel Xeon.If you have AMD Epyc that supports
NPS0
, use that for best performance.Eventually if you have enough RAM to hold the entire model twice, you can use data parallel to load weights duplicated for each CPU socket single NUMA node (see ktransformers for that, and possibly llama.cpp experimental branch coming).
Otherwise, using multiple CPU sockets potentially degrades performance on single inference workloads due to cross NUMA access latency and bandwidth bottleneck.
Overview
I have limited access to some fairly new high end Intel Xeon servers including a dual socket 6980P Level1Techs YouTube 6980P Review and hopefully soon the recently available 6787P Level1Techs YouTube 6787P Review.
As there is no GPU installed on this specific dual 6980P rig, I am skipping testing ktransformers for now, and testing llama.cpp CPU only inferencing of various R1 671B GGUF quants.
I'm curious if others have any tips on how to improve performance for similar configurations specifically newer Intel Xeon CPUs with AMX extensions in dual or single socket configurations.
Especially how is the best way to take advantage of both CPU sockets simultaneously?
6980P Benchmarks
Here are the high level results of my initial llama-bench testing for token generation. Methodology details and discussion provided below.
Default BIOS
SNC=Auto/Enable
for 3x NUMA Nodes per CPU SocketAfter setting BIOS
SNC=Disable
, basically same as AMD Epyc'sNPS1
, 1x NUMA Node per CPU Socketllama.cpp@?ba765438?
ik_llama.cpp@f2fb15de
llama.cpp@?ba765438?
llama.cpp@?ba765438?
ik_llama.cpp@f2fb15de
llama.cpp@?ba765438?
ik_llama.cpp@f2fb15de
Related Issues
Potentially related issues include:
ktransformers
The ktransformers project is doing some interesting things specific to Intel Xeon optimizations:
USE_NUMA=1
flag seems to copy the entire model weights into memory twice (once for each CPU socket?) presumably to alleviate cross socket UPI link bottlenecks?Theory
Assuming memory bandwidth is the limiting factor and not CPU bottleneck, the theoretical maximum token generation speed can be calculated with:
Formula
Definitions
Example Calculation
For 225 GB/s aggregate RAM bandwidth running Q2@2.51 bits-per-weight quantization:
Discussion
Some thoughts, musings, and wild speculations:
SNC=Disable
on newer Intel Xeon 6th Generation CPUs for 1x NUMA Node per CPU socket.-ot exps=CPU
stuff for NUMA nodes? lol...numactl --interleave=0,1,2
and--numa numactl
gives only 1.3x better performance over single node.SNC=Disable
on newer Intel Xeon 6th Generation CPUs for 1x NUMA Node per CPU socket.send()
calls it is very very slow.load_tensors: tensor 'token_embd.weight' (q8_0) (and 54 others) cannot be used with preferred buffer type AMX, using CPU instead
amx_tile
amd_int8
may benefit fromint8
block-wise quant.Conclusion
A few things learned along the way:
--numa distribute
does not seem to honornumactl
nortaskset
processor affinity and you may end up with CPU cores on nodes other than memory nodes. So you might want to use--numa numactl
and double check withnumastat
andbtop
to make sure cores and memory allocating how you expect.-mmap 0
to pay attention which NUMA nodes get used as it might not distribute evenly or optimally.tg 5.43 ± 0.02 @ 108 threads
with default CPU build vstg 5.63 ± 0.02 @ 108 threads
with AMX explicitly enabled as shown above.Cheers and thanks for your time! Good luck to everyone in the quest for more tok/sec!
Methodology and Notes
Click the arrow to open the fold filled with benchmarking logs and notes.
Methodology and Notes
Methodology and Notes
System Information
Memory Benchmarks
Compile AMX Extensions
Single NUMA Node
UD-Q2_K_XL
Benchmarking Stock CompilerResults
build: a800ae46 (4783)
Single NUMA Node
UD-Q2_K_XL
Benchmarking Intel Base Kit CompilerNo improvements here and pp512 regressions over compiling the AMX way or even defaults. Likely indicates a memory bandwidth bottle neck and not CPU performance...? Need flame graphs lol...
Results
build: a800ae46 (4783)
Single Socket
UD-Q2_K_XL
BenchmarkingResults
NOTE: I deleted the first pp/tg warm-up runs which used 128 threads to distribute memory across nodes.
build: a800ae46 (4783)
(more threads is worse in other similar benchmarks I've run so I didn't search that space)
NOTE: This screenshot was before I switched from
--numa distribute
to--numa numactl
and you can see CPU cores active on the opposite socket as where memory was allocated causing even worse performance.Single Socket
Q4_K_M
BenchmarkingResults
build: a800ae46 (4783)
Dual Socket
Q4_K_M
Benchmarking Take 1Take 2 below turned out slightly faster.
Results
build: a800ae46 (4783)
Dual Socket
Q4_K_M
Benchmarking Take 2This turned out slightly faster than Take 1 above.
Results
Single Socket
Q8_0
BenchmarkingResults
build: a800ae46 (4783)
Dual Socket
Q8_0
BenchmarkingResults
build: a800ae46 (4783)
Intel oneAPI Base Toolkit
RPC Experiment Take 1
I managed to use the experimental llama.cpp RPC backend feature. Essentially I started up a process on one CPU socket with 200GB RAM available. Then I started llama-server on the other CPU socket specifying the RPC endpoint as a
--device
so it acts like a GPU. Then you "offload" layers to the RPC device. This was nice in that I could keep all processing local to a single NUMA node on each CPU socket. However, the performance was much worse likely due to synchronous stages stalling and barely utilizing CPU cores as described in this github issue comment. It got maybe 2 tok/sec after fully warmed up... :oof:UPDATE I realized the rpc-server uses
GGML_DEFAULT_N_THREADS
which was set to 4. Write and test patch to force CPU backend and specify threads.UPDATE 2 I made the patches and tried more stuff over in this github issue comment but still no better...
RPC Experiment Take 2
Trying to keep MoE block processing and memory in single NUMA node, but this experiment got only 2 tok/sec.
RPC Experiment Take 3
SNC=Disable
ModeThis will give us 1x NUMA Node per CPU Socket.
All following tests done on
build: e0331326 (4696)
which is essentiallysl/custom-tensor-offload
andug/rpc-numa-cpu-backend
compiled with RPC enabled, but not actually using any of those features.Memory Benchmarks
1x CPU
UD-Q2_K_XL
Results
ik_llama.cpp
Results
2x CPU
UD-Q2_K_XL
Results
1x CPU
Q4_K_M
Results
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | -------------------: |
| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | RPC | 99 | 64 | pp512 | 24.40 ± 4.96 |
| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | RPC | 99 | 64 | tg128 | 6.95 ± 0.33 |
| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | RPC | 99 | 43 | pp512 | 25.41 ± 0.70 |
| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | RPC | 99 | 43 | tg128 | 5.88 ± 0.04 |
| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | RPC | 99 | 64 | pp512 | 33.83 ± 0.50 |
| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | RPC | 99 | 64 | tg128 | 7.30 ± 0.02 |
| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | RPC | 99 | 86 | pp512 | 39.24 ± 0.94 |
| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | RPC | 99 | 86 | tg128 | 7.82 ± 0.02 |
| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | RPC | 99 | 128 | pp512 | 46.84 ± 1.46 |
| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | RPC | 99 | 128 | tg128 | 7.16 ± 0.02 |
ik_llama.cpp
Results
2x CPU
Q4_K_M
Results
TODO
1x CPU
Q8_0
Results
ik_llama.cpp
Results
2x CPU
Q8_0
Results
TODO
Beta Was this translation helpful? Give feedback.
All reactions