Intel Xeon performance on R1 671B quants? #12088

ubergarm · 2025-02-26T19:33:20Z

ubergarm
Feb 26, 2025

Intel Xeon performance on R1 671B quants?

Last Updated On: Tue Mar 18 12:11:53 AM EDT 2025

tl;dr;

UPDATE: Fastest CPU only benchmarks to date are with FlashMLA-2 and other optimizations on ik_llama.cpp fork.
UPDATE: Interesting post regarding AMX optimizations and DeepSeek-R1.
UPDATE: Definitely check out @fairydreaming's deep dive on similar issues here

llama.cpp seems to run best with all memory in a single NUMA Node as of Q1 2025.

So configure in BIOS a single NUMA node per CPU socket and only use a single CPU socket e.g. SNC=Disable on 6th generation Intel Xeon.

If you have AMD Epyc that supports NPS0, use that for best performance.

Eventually if you have enough RAM to hold the entire model twice, you can use data parallel to load weights duplicated for each CPU socket single NUMA node (see ktransformers for that, and possibly llama.cpp experimental branch coming).

Otherwise, using multiple CPU sockets potentially degrades performance on single inference workloads due to cross NUMA access latency and bandwidth bottleneck.

Overview

I have limited access to some fairly new high end Intel Xeon servers including a dual socket 6980P Level1Techs YouTube 6980P Review and hopefully soon the recently available 6787P Level1Techs YouTube 6787P Review.

As there is no GPU installed on this specific dual 6980P rig, I am skipping testing ktransformers for now, and testing llama.cpp CPU only inferencing of various R1 671B GGUF quants.

I'm curious if others have any tips on how to improve performance for similar configurations specifically newer Intel Xeon CPUs with AMX extensions in dual or single socket configurations.

Especially how is the best way to take advantage of both CPU sockets simultaneously?

6980P Benchmarks

Here are the high level results of my initial llama-bench testing for token generation. Methodology details and discussion provided below.

Default BIOS SNC=Auto/Enable for 3x NUMA Nodes per CPU Socket

Quantization	Tokens/Second	NUMA Configuration
Q2_K_XL	5.9	Single NUMA Node
Q2_K_XL	7.7	3 NUMA Nodes on 1x CPU
Q2_K_XL	~2	5+1 rpc-server nodes over loopback
Q4_K_M	6.9	3 NUMA Nodes on 1x CPU
Q4_K_M	<5.3 ?!?	6 NUMA Nodes across 2x CPU sockets
Q8_0	5.6	3 NUMA Nodes on 1x CPU
Q8_0	4.1 ?!?	6 NUMA Nodes across 2x CPU sockets

After setting BIOS SNC=Disable, basically same as AMD Epyc's NPS1, 1x NUMA Node per CPU Socket

Quantization	Tokens/Second	NUMA Configuration	Inference Engine
Q2_K_XL	8.9	1x NUMA Node on 1x CPU	`llama.cpp@?ba765438?`
Q2_K_XL	9.9	1x NUMA Node on 1x CPU	`ik_llama.cpp@f2fb15de`
Q2_K_XL	5.8 ?!?	2x NUMA Nodes on 2x CPUs	`llama.cpp@?ba765438?`
Q4_K_M	7.8	1x NUMA Node on 1x CPU	`llama.cpp@?ba765438?`
Q4_K_M	10.0	1x NUMA Node on 1x CPU	`ik_llama.cpp@f2fb15de`
Q4_K_M	meh...	2x NUMA Nodes on 2x CPUs
Q8_0	6.2	1x NUMA Node on 1x CPU	`llama.cpp@?ba765438?`
Q8_0	7.5	1x NUMA Node on 1x CPU	`ik_llama.cpp@f2fb15de`
Q8_0	meh...	2x NUMA Nodes on 2x CPUs

Related Issues

Potentially related issues include:

ktransformers

The ktransformers project is doing some interesting things specific to Intel Xeon optimizations:

The mysterious v0.3 binary whl supposedly has AMX compiler optimizations and possibly requires the bf16 model which seems to be online quantized into fp8 for CPU and fp4 for GPU inferencing at runtime?
The USE_NUMA=1 flag seems to copy the entire model weights into memory twice (once for each CPU socket?) presumably to alleviate cross socket UPI link bottlenecks?
Interesting new Hybrid Quantized Model with GGML quantized MoE Blocks for CPU inference and Attention/Shared-Expert FP8 quantization supported on NVIDIA 4090 or newer (not on 3090 😭) for closer to original fp8 quality performance.

Theory

Assuming memory bandwidth is the limiting factor and not CPU bottleneck, the theoretical maximum token generation speed can be calculated with:

Formula

$$TG_{max} = \frac{R_b}{P_a \times \left(\frac{bpw}{8}\right)}$$

Definitions

( $TG_{max}$ ): Theoretical Maximum Token Generation Speed (tokens/second)
( $R_b$ ): RAM Bandwidth (GB/s)
( $P_a$ ): MoE Parameters Activated (Billions (B)) – e.g., 37B for R1
( $bpw$ ): Quantization, bits per weight – e.g., Q4=4, Q2=2.51bpw. (converted to bytes using constant 8 in equation)

Example Calculation
For 225 GB/s aggregate RAM bandwidth running Q2@2.51 bits-per-weight quantization:

$$\frac{225}{\left(\frac{2.51}{8} \times 37\right)} \approx 19 \text{ tokens/sec}$$

Discussion

Some thoughts, musings, and wild speculations:

Is there a way to improve probability of a CPU thread executing on the same NUMA node as memory placement?
- The best way to run llama.cpp currently is as much RAM bandwidth crammed into a single NUMA node. Avoid multiple nodes and stay way far away from multiple CPU sockets for now.
- UPDATE Set BIOS to SNC=Disable on newer Intel Xeon 6th Generation CPUs for 1x NUMA Node per CPU socket.
Is the kv-cache malloc'd in a single node on the heap or any tricks to try there e.g. like -ot exps=CPU stuff for NUMA nodes? lol...
- Yes, however keeping kv-cache on same node with attention and shared experts is still slow for now with rcp-server.
Is cross numa latency within a single CPU causing issues given single node speed looks most efficient?
- Yes, using numactl --interleave=0,1,2 and --numa numactl gives only 1.3x better performance over single node.
Are there BIOS configurations for dual Intel Xeon rigs similar to AMD NPS0 mode? (I don't have access to BIOS)
- I don't think so, so if your AMD Epyc supports NPS0 it is probably the best solution for now for llama.cpp specifically.
- UPDATE Set BIOS to SNC=Disable on newer Intel Xeon 6th Generation CPUs for 1x NUMA Node per CPU socket.
Does Intel benefit from special OpenVINO, special BLAS libaries, or custom compilers for CPU only inference?
- Kind of maybe but now right now, but only for certain quantizations and likely requires hybrid GGUFs which are beyond me at this point.
Does this new SearchSavior/OpenArc project have potential for R1 671B?
- Cool project, will see where it goes.
Would it even make sense to start 2x copies of llama.cpp (one for each CPU socket) and link them with RPC (shmem?)
- KT and experimental llama.cpp versions offer data parallel = 2 configurations.
- I tried tensor parallel = 2 given llama rpc-server is not a true async tensor parallel implementation but simple synchronous send() calls it is very very slow.
Might need custom hybrid quant to take advantage of AMX load_tensors: tensor 'token_embd.weight' (q8_0) (and 54 others) cannot be used with preferred buffer type AMX, using CPU instead
- TODO test this, there are special int8 quants and the amx_tile amd_int8 may benefit from int8 block-wise quant.

Conclusion

A few things learned along the way:

Be careful as --numa distribute does not seem to honor numactl nor taskset processor affinity and you may end up with CPU cores on nodes other than memory nodes. So you might want to use --numa numactl and double check with numastat and btop to make sure cores and memory allocating how you expect.
Likewise be careful if you disable mmap e.g. -mmap 0 to pay attention which NUMA nodes get used as it might not distribute evenly or optimally.
The sweet spot for number of threads to run can vary quite a bit on different systems for pp and tg so search it out using llama-bench.
If you have double the RAM and dual socket use data parallel, and hopefully soon we will see tensor parallel and MoE parallel for CPU backends.
It seems like building with explicit AMX flags shown above does not effect much though possibly showed about 5% improvement for Q8: e.g. tg 5.43 ± 0.02 @ 108 threads with default CPU build vs tg 5.63 ± 0.02 @ 108 threads with AMX explicitly enabled as shown above.

Use the right (hybrid) quantized version optimized for your exact configuration and cpuflags and GPU combinations.

load_tensors: tensor 'token_embd.weight' (q4_K) (and 850 others) cannot be used with preferred buffer type AMX, using CPU instead
load_tensors: tensor 'token_embd.weight' (q4_K) (and 850 others) cannot be used with preferred buffer type AMX, using CPU instead
load_tensors: tensor 'token_embd.weight' (q8_0) (and 54 others) cannot be used with preferred buffer type AMX, using CPU instead

Cheers and thanks for your time! Good luck to everyone in the quest for more tok/sec!

Methodology and Notes

Click the arrow to open the fold filled with benchmarking logs and notes.

Methodology and Notes

System Information

## CPU
$ lscpu | grep Xeon
Model name:                           Intel(R) Xeon(R) 6980P

## 128 P cores per socket with SMT enabled
$ echo $(nproc)
512

## Flags amx_bf16 amx_int8 amx_tile
$ lscpu | grep Flags
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect user_shstk avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities

## 1.5 TB RAM
$ free -h
               total        used        free      shared  buff/cache   available
Mem:           1.5Ti        25Gi       1.1Ti        13Mi       365Gi       1.5Ti
Swap:          8.0Gi        31Mi       8.0Gi

## 24x 64GB DDR5-6400 RDIMMs
$ sudo lshw | grep MTC40F2046S1RC64BDY | wc
24 ...

#load_tensors: tensor 'token_embd.weight' (q4_K) (and 850 others) cannot be used with preferred buffer type AMX, using CPU instead# NUMA configuration (funny how 128 cores divides into 3 NUMA nodes lol)
$ numactl --hardware --cpu-compress
available: 6 nodes (0-5)
node 0 cpus: 0-42, 256-298 (86)
node 0 size: 257688 MB
node 0 free: 187617 MB
node 1 cpus: 43-85, 299-341 (86)
node 1 size: 258018 MB
node 1 free: 193635 MB
node 2 cpus: 86-127, 342-383 (84)
node 2 size: 258019 MB
node 2 free: 194227 MB
node 3 cpus: 128-170, 384-426 (86)
node 3 size: 258018 MB
node 3 free: 193767 MB
node 4 cpus: 171-213, 427-469 (86)
node 4 size: 258018 MB
node 4 free: 193839 MB
node 5 cpus: 214-255, 470-511 (84)
node 5 size: 257949 MB
node 5 free: 191309 MB
node distances:
node   0   1   2   3   4   5
  0:  10  12  12  21  21  21
  1:  12  10  12  21  21  21
  2:  12  12  10  21  21  21
  3:  21  21  21  10  12  12
  4:  21  21  21  12  10  12
  5:  21  21  21  12  12  10

## Ubuntu LTS with stock kernel (kind of old I know)
$ grep PRETTY /etc/os-release
PRETTY_NAME="Ubuntu 24.04.2 LTS"

$ uname -a
Linux testbox 6.8.0-53-generic #55-Ubuntu SMP PREEMPT_DYNAMIC Fri Jan 17 15:37:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

## Enabled THP (transparent huge pages)
$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

$ grep -i hugepages /proc/meminfo
AnonHugePages:     34816 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

## Set performance mode
$ powerprofilesctl get
performance

## Confirm Linux Kernel NUMA Balancing Enabled
$ cat /proc/sys/kernel/numa_balancing
1

Memory Benchmarks

## Download and run Intel mlc
#  https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html

## Temporarily enable 4000 hugepages for mlc testing
$ echo 4000 | sudo tee /proc/sys/vm/nr_hugepages

## Test Entire System (results just below)
$ sudo ./Linux/mlc | tee -a whole-system.log

## Monitor Memory Placement
$ watch sudo numastat -p $(pidof mlc)
Per-node process memory usage (in MBs) for PID 3010820 (mlc)
                           Node 0          Node 1          Node 2          Node 3          Node 4          Node 5           Total
                  --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Huge                         0.00            0.00            0.00            0.00            0.00            0.00            0.00
Heap                         0.33            0.22            0.22            0.22            0.22            0.34            1.55
Stack                        0.02            0.00            0.00            0.00            0.01            0.00            0.03
Private                  14358.25        15400.01        13800.00        13000.00        10100.12         5209.45        71867.84
----------------  --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Total                    14358.60        15400.23        13800.22        13000.22        10100.35         5209.79        71869.41

## Results
Intel(R) Memory Latency Checker - v3.11b
Measuring idle latencies for sequential access (in ns)...
                Numa node
Numa node            0       1       2       3       4       5
       0         138.7   168.0   208.5   394.1   475.2   445.1
       1         160.3   134.4   170.4   415.2   448.2   479.7
       2         156.2   123.6   106.5   507.8   513.2   452.5
       3         396.0   476.0   445.6   102.0   129.4   157.5
       4         419.7   452.6   421.2   122.1   102.4   130.2
       5         445.4   449.5   392.4   148.3   122.3   103.8

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :      1126026.6
3:1 Reads-Writes :      972377.5
2:1 Reads-Writes :      933247.3
1:1 Reads-Writes :      927164.2
Stream-triad like:      939630.2

Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
                Numa node
Numa node            0       1       2       3       4       5
       0        187911.4        188622.8        188716.9        94137.8 93596.5 93730.5
       1        188260.8        188176.4        188653.1        94495.4 90659.3 93774.2
       2        188624.6        188626.7        188129.6        94509.6 27886.4 93792.7
       3        94161.1 93415.7 94558.3 187851.4        188418.6        188691.9
       4        94201.1 91712.7 94546.8 188169.2        188067.6        188544.2
       5        94183.2 44861.0 94241.8 188416.4        188380.0        187933.8

Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject  Latency Bandwidth
Delay   (ns)    MB/sec
==========================
 00000  378.26  1125007.8
 00002  381.36  1125706.3
 00008  382.90  1125594.5
 00015  381.40  1128101.6
 00050  377.79  1129501.1
 00100  296.51  1117783.2
 00200  301.72  1122699.0
 00300  207.87  1017250.0
 00400  170.76   782113.4
 00500  157.40   665276.4
 00700  138.25   488635.4
 01000  128.65   349546.6
 01300  125.55   271876.5
 01700  123.93   209644.5
 02500  116.19   143990.9
 03500  120.17   103477.5
 05000  119.53    72875.8
 09000  113.89    40898.3
 20000  115.14    18113.6

Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT  latency        80.5
Local Socket L2->L2 HITM latency        80.9
Remote Socket L2->L2 HITM latency (data address homed in writer socket)
                        Reader Numa Node
Writer Numa Node     0       1       2       3       4       5
            0        -    99.3   124.9   376.2   401.7   429.5
            1    108.8       -   100.9   452.1   425.7   422.2
            2    131.0   103.8       -   435.5   407.4   378.1
            3    372.3   393.3   423.4       -   101.2   125.6
            4    444.2   414.2   413.5   106.3       -   100.9
            5    429.5   399.3   374.0   130.3   106.1       -
Remote Socket L2->L2 HITM latency (data address homed in reader socket)
                        Reader Numa Node
Writer Numa Node     0       1       2       3       4       5
            0        -   109.6   140.2   381.2   444.0   440.0
            1    106.9       -   110.8   405.8   414.7   411.6
            2    137.1   103.8       -   436.3   442.6   381.2
            3    380.8   441.6   439.1       -   110.6   139.5
            4    406.3   412.7   411.6   105.8       -   110.7
            5    436.7   440.5   381.2   136.3   105.9       -

Compile AMX Extensions

## recent llama.cpp version
$ git rev-parse --short HEAD
a800ae46

## Compile with explicit AMX extensions
$ cmake -B ./build_amx \
    -DGGML_NATIVE=OFF \
    -DGGML_AVX512=ON \
    -DGGML_AVX512_BF16=ON \
    -DGGML_AVX512_VBMI=ON \
    -DGGML_AVX512_VNNI=ON \
    -DGGML_AMX_TILE=ON \
    -DGGML_AMX_INT8=ON \
    -DGGML_AMX_BF16=ON
$ cmake --build ./build_amx --config Release -j $(nproc)

## drop all caches before each run
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches

Single NUMA Node `UD-Q2_K_XL` Benchmarking Stock Compiler

$ numactl -N 0 -b -m 0 -C 0-42,256-298 \
    ./build_amx/bin/llama-bench \
    --model /mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf \
    --cache-type-k f16 \
    --cache-type-v f16 \
    --numa numactl \
    --threads 43,16,32,43,54,86

$ watch sudo numastat -p $(pidof llama-bench)
Per-node process memory usage (in MBs) for PID 3113697 (llama-bench)
                           Node 0          Node 1          Node 2          Node 3          Node 4          Node 5           Total
                  --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Huge                         0.00            0.00            0.00            0.00            0.00            0.00            0.00
Heap                        38.61            0.00            0.00            0.00            0.00            0.00           38.61
Stack                        0.03            0.00            0.00            0.00            0.00            0.00            0.03
Private                 223061.28            0.00            0.00            1.72            0.14            0.00       223063.14
----------------  --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Total                   223099.92            0.00            0.00            1.72            0.14            0.00       223101.78

Results

model	size	params	backend	threads	test	t/s
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	CPU	16	pp512	13.93 ± 0.16
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	CPU	16	tg128	4.84 ± 0.03
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	CPU	32	pp512	24.97 ± 0.24
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	CPU	32	tg128	5.34 ± 0.01
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	CPU	43	pp512	30.93 ± 0.36
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	CPU	43	tg128	5.90 ± 0.01
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	CPU	54	pp512	24.03 ± 0.04
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	CPU	54	tg128	5.21 ± 0.01
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	CPU	86	pp512	32.36 ± 0.10
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	CPU	86	tg128	5.64 ± 0.03

build: a800ae46 (4783)

Single NUMA Node `UD-Q2_K_XL` Benchmarking Intel Base Kit Compiler

No improvements here and pp512 regressions over compiling the AMX way or even defaults. Likely indicates a memory bandwidth bottle neck and not CPU performance...? Need flame graphs lol...

$ numactl -N 0 -b -m 0 -C 0-42,256-298 \
    ./build_intel/bin/llama-bench \
    --model /mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf \
    --cache-type-k f16 \
    --cache-type-v f16 \
    --numa numactl \
    --threads 43,16,32,43,54,86

$ watch sudo numastat -p $(pidof llama-bench)
Per-node process memory usage (in MBs) for PID 3266837 (llama-bench)
                           Node 0          Node 1          Node 2          Node 3          Node 4          Node 5           Total
                  --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Huge                         0.00            0.00            0.00            0.00            0.00            0.00            0.00
Heap                        63.23            0.00            0.00            0.00            0.00            0.00           63.23
Stack                        0.09            0.00            0.00            0.00            0.00            0.00            0.09
Private                 223392.67            0.00            0.00            1.43            0.06            0.00       223394.16
----------------  --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Total                   223455.99            0.00            0.00            1.43            0.06            0.00       223457.48

Results

model	size	params	backend	threads	test	t/s
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	BLAS	16	pp512	4.08 ± 0.14
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	BLAS	16	tg128	4.19 ± 0.01
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	BLAS	32	pp512	5.31 ± 0.14
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	BLAS	32	tg128	5.31 ± 0.04
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	BLAS	43	pp512	5.42 ± 0.06
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	BLAS	43	tg128	5.88 ± 0.01
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	BLAS	54	pp512	5.37 ± 0.06
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	BLAS	54	tg128	5.83 ± 0.07
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	BLAS	86	pp512	5.43 ± 0.07
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	BLAS	86	tg128	5.39 ± 0.05

build: a800ae46 (4783)

Single Socket `UD-Q2_K_XL` Benchmarking

## Benchmark Q2 2.51bpw Quant on single CPU socket
## *NOTE*: Throw away first pp/tg runs as mmap() model data warming loading into page cache effects initial performance
$ numactl -N 0,1,2 -b -m 0,1,2 -C 0-127,256-383 \
    ./build_amx/bin/llama-bench \
    --model /mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf \
    --cache-type-k f16 \
    --cache-type-v f16 \
    --numa numactl \
    --threads 128,64,86,108,128

## (results below)

## Observe how model weights get distributed across NUMA nodes' memory
## Using -mmap 0, or non-optimal numactl / --numa options may not distribute optimally and hurt performance.
$ watch -n1 numastat -p $(pidof llama-bench)
Per-node process memory usage (in MBs) for PID 3042851 (llama-bench)
                           Node 0          Node 1          Node 2          Node 3          Node 4          Node 5           Total
                  --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Huge                         0.00            0.00            0.00            0.00            0.00            0.00            0.00
Heap                        35.32            3.41            0.00            0.00            0.00            0.00           38.72
Stack                        0.05            0.00            0.00            0.00            0.00            0.00            0.05
Private                  76389.92        73387.73        72033.31            1.72            0.14            0.00       221812.83
----------------  --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Total                    76425.28        73391.14        72033.31            1.72            0.14            0.00       221851.60

Results
NOTE: I deleted the first pp/tg warm-up runs which used 128 threads to distribute memory across nodes.

model	size	params	backend	threads	test	t/s
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	CPU	64	pp512	35.46 ± 0.98
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	CPU	64	tg128	7.59 ± 0.42
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	CPU	86	pp512	40.57 ± 0.25
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	CPU	86	tg128	7.67 ± 0.02
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	CPU	108	pp512	43.63 ± 0.25
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	CPU	108	tg128	7.37 ± 0.04
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	CPU	128	pp512	44.12 ± 3.66
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	CPU	128	tg128	7.12 ± 0.02

build: a800ae46 (4783)

(more threads is worse in other similar benchmarks I've run so I didn't search that space)

NOTE: This screenshot was before I switched from --numa distribute to --numa numactl and you can see CPU cores active on the opposite socket as where memory was allocated causing even worse performance.

Single Socket `Q4_K_M` Benchmarking

$ numactl -N 0,1,2 -b -m 0,1,2 -C 0-127,256-383 \
    ./build_amx/bin/llama-bench \
    --model /mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf \
    --cache-type-k f16 \
    --cache-type-v f16 \
    --numa numactl \
    --threads 128,64,86,108,128

$ watch -n1 numastat -p $(pidof llama-bench)
Per-node process memory usage (in MBs) for PID 3088631 (llama-bench)
                           Node 0          Node 1          Node 2          Node 3          Node 4          Node 5           Total
                  --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Huge                         0.00            0.00            0.00            0.00            0.00            0.00            0.00
Heap                        28.45            0.27           10.02            0.00            0.00            0.00           38.73
Stack                        0.05            0.00            0.00            0.00            0.00            0.00            0.05
Private                 128077.27       127615.25       134876.96            1.72            0.14            0.00       390571.34
----------------  --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Total                   128105.76       127615.51       134886.98            1.72            0.14            0.00       390610.12

Results

model	size	params	backend	threads	test	t/s
deepseek2 671B Q4_K - Medium	376.65 GiB	671.03 B	CPU	128	pp512	30.50 ± 5.27
deepseek2 671B Q4_K - Medium	376.65 GiB	671.03 B	CPU	128	tg128	6.36 ± 0.23
deepseek2 671B Q4_K - Medium	376.65 GiB	671.03 B	CPU	64	pp512	33.20 ± 0.53
deepseek2 671B Q4_K - Medium	376.65 GiB	671.03 B	CPU	64	tg128	6.56 ± 0.06
deepseek2 671B Q4_K - Medium	376.65 GiB	671.03 B	CPU	86	pp512	38.08 ± 0.70
deepseek2 671B Q4_K - Medium	376.65 GiB	671.03 B	CPU	86	tg128	6.93 ± 0.02
deepseek2 671B Q4_K - Medium	376.65 GiB	671.03 B	CPU	108	pp512	41.81 ± 0.62
deepseek2 671B Q4_K - Medium	376.65 GiB	671.03 B	CPU	108	tg128	6.81 ± 0.03
deepseek2 671B Q4_K - Medium	376.65 GiB	671.03 B	CPU	128	pp512	45.79 ± 1.75
deepseek2 671B Q4_K - Medium	376.65 GiB	671.03 B	CPU	128	tg128	6.71 ± 0.04

build: a800ae46 (4783)

Dual Socket `Q4_K_M` Benchmarking Take 1

Take 2 below turned out slightly faster.

$ numactl -N 0-5 -b -m 0-5 -C 0-511 \
    ./build_amx/bin/llama-bench \
    --model /mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf \
    --cache-type-k f16 \
    --cache-type-v f16 \
    --numa numactl \
    --threads 256,86,128,172,216,256,296


$ watch -n1 numastat -p $(pidof llama-bench)
Per-node process memory usage (in MBs) for PID 3107662 (llama-bench)
                           Node 0          Node 1          Node 2          Node 3          Node 4          Node 5           Total
                  --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Huge                         0.00            0.00            0.00            0.00            0.00            0.00            0.00
Heap                         0.05            9.89            0.40            0.04           28.47            0.02           38.87
Stack                        0.00            0.00            0.02            0.00            0.06            0.00            0.08
Private                  62916.21        65258.70        64395.89        64235.00        64886.67        64430.31       386122.79
----------------  --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Total                    62916.26        65268.59        64396.31        64235.04        64915.20        64430.34       386161.73

Results

model	size	params	backend	threads	test	t/s
deepseek2 671B Q4_K - Medium	376.65 GiB	671.03 B	CPU	86	pp512	27.83 ± 0.23
deepseek2 671B Q4_K - Medium	376.65 GiB	671.03 B	CPU	86	tg128	5.02 ± 0.03
deepseek2 671B Q4_K - Medium	376.65 GiB	671.03 B	CPU	128	pp512	32.80 ± 0.31
deepseek2 671B Q4_K - Medium	376.65 GiB	671.03 B	CPU	128	tg128	4.84 ± 0.04
deepseek2 671B Q4_K - Medium	376.65 GiB	671.03 B	CPU	172	pp512	35.22 ± 1.21
deepseek2 671B Q4_K - Medium	376.65 GiB	671.03 B	CPU	172	tg128	4.24 ± 0.01
deepseek2 671B Q4_K - Medium	376.65 GiB	671.03 B	CPU	216	pp512	34.82 ± 2.49
deepseek2 671B Q4_K - Medium	376.65 GiB	671.03 B	CPU	216	tg128	3.95 ± 0.02
deepseek2 671B Q4_K - Medium	376.65 GiB	671.03 B	CPU	256	pp512	35.20 ± 0.29
deepseek2 671B Q4_K - Medium	376.65 GiB	671.03 B	CPU	256	tg128	3.81 ± 0.02
deepseek2 671B Q4_K - Medium	376.65 GiB	671.03 B	CPU	296	pp512	33.81 ± 0.71
deepseek2 671B Q4_K - Medium	376.65 GiB	671.03 B	CPU	296	tg128	3.37 ± 0.03

build: a800ae46 (4783)

Dual Socket `Q4_K_M` Benchmarking Take 2

This turned out slightly faster than Take 1 above.

# https://github.com/ggml-org/llama.cpp/pull/11580#issuecomment-2640861105
# disable numa balancing didn't improve it signficantly
# so leaving off for this run to keep this guide more uniform in methodology
# $ echo 0 | sudo tee /proc/sys/kernel/numa_balancing

# drop all caches
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches

# throw away first "warm-up" run due to page caching being populated
numactl --interleave=all \
    ./build_amx/bin/llama-bench \
    --model /mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf \
    --cache-type-k f16 \
    --cache-type-v f16 \
    --numa distribute \
    --threads 256,86,128,172

$ watch -n1 numastat -p $(pidof llama-bench)
Per-node process memory usage (in MBs) for PID 3234940 (llama-bench)
                           Node 0          Node 1          Node 2          Node 3          Node 4          Node 5           Total
                  --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Huge                         0.00            0.00            0.00            0.00            0.00            0.00            0.00
Heap                         8.48            2.48            8.48            4.48           10.48            4.48           38.87
Stack                        0.01            0.02            0.02            0.01            0.01            0.01            0.08
Private                  64383.48        64378.84        64374.48        64382.78        64379.12        64378.23       386276.93
----------------  --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Total                    64391.96        64381.33        64382.97        64387.27        64389.62        64382.73       386315.88

Results

model	size	params	backend	threads	test	t/s
deepseek2 671B Q4_K - Medium	376.65 GiB	671.03 B	CPU	86	pp512	28.24 ± 0.26
deepseek2 671B Q4_K - Medium	376.65 GiB	671.03 B	CPU	86	tg128	5.34 ± 0.04
deepseek2 671B Q4_K - Medium	376.65 GiB	671.03 B	CPU	128	pp512	34.28 ± 0.21
deepseek2 671B Q4_K - Medium	376.65 GiB	671.03 B	CPU	128	tg128	4.89 ± 0.01
deepseek2 671B Q4_K - Medium	376.65 GiB	671.03 B	CPU	172	pp512	36.40 ± 0.36
deepseek2 671B Q4_K - Medium	376.65 GiB	671.03 B	CPU	172	tg128	4.45 ± 0.01

Single Socket `Q8_0` Benchmarking

$ numactl -N 0,1,2 -b -m 0,1,2 -C 0-127,256-383 \
    ./build_amx/bin/llama-bench \
    --model /mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-Q8_0/DeepSeek-R1.Q8_0-00001-of-00015.gguf \
    --cache-type-k f16 \
    --cache-type-v f16 \
    --numa numactl \
    --threads 128,64,86,108,128

$ watch -n1 numastat -p $(pidof llama-bench)
Per-node process memory usage (in MBs) for PID 3091581 (llama-bench)
                           Node 0          Node 1          Node 2          Node 3          Node 4          Node 5           Total
                  --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Huge                         0.00            0.00            0.00            0.00            0.00            0.00            0.00
Heap                        31.20            7.51            0.00            0.00            0.00            0.00           38.71
Stack                        0.03            0.02            0.00            0.00            0.00            0.00            0.05
Private                 235043.34       229996.21       228409.38            1.72            0.14            0.00       693450.79
----------------  --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Total                   235074.57       230003.74       228409.38            1.72            0.14            0.00       693489.55

Results

model	size	params	backend	threads	test	t/s
deepseek2 671B Q8_0	664.29 GiB	671.03 B	CPU	64	pp512	34.91 ± 0.60
deepseek2 671B Q8_0	664.29 GiB	671.03 B	CPU	64	tg128	5.20 ± 0.02
deepseek2 671B Q8_0	664.29 GiB	671.03 B	CPU	86	pp512	40.66 ± 0.30
deepseek2 671B Q8_0	664.29 GiB	671.03 B	CPU	86	tg128	5.54 ± 0.03
deepseek2 671B Q8_0	664.29 GiB	671.03 B	CPU	108	pp512	46.72 ± 0.43
deepseek2 671B Q8_0	664.29 GiB	671.03 B	CPU	108	tg128	5.63 ± 0.02
deepseek2 671B Q8_0	664.29 GiB	671.03 B	CPU	128	pp512	52.54 ± 0.56
deepseek2 671B Q8_0	664.29 GiB	671.03 B	CPU	128	tg128	5.40 ± 0.01

build: a800ae46 (4783)

Dual Socket `Q8_0` Benchmarking

$ numactl -N 0-5 -b -m 0-5 -C 0-511 \
    ./build_amx/bin/llama-bench \
    --model /mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-Q8_0/DeepSeek-R1.Q8_0-00001-of-00015.gguf \
    --cache-type-k f16 \
    --cache-type-v f16 \
    --numa numactl \
    --threads 256,128,172,216,256,296

$ watch -n1 numastat -p $(pidof llama-bench)
Per-node process memory usage (in MBs) for PID 3101638 (llama-bench)
                           Node 0          Node 1          Node 2          Node 3          Node 4          Node 5           Total
                  --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Huge                         0.00            0.00            0.00            0.00            0.00            0.00            0.00
Heap                         0.02            2.00            0.03            2.03           34.78            0.00           38.86
Stack                        0.00            0.00            0.00            0.01            0.07            0.00            0.08
Private                 109025.81       108656.96       109494.54       121461.80       119104.16       120311.99       688055.26
----------------  --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Total                   109025.83       108658.96       109494.57       121463.85       119139.00       120311.99       688094.20

Results

model	size	params	backend	threads	test	t/s
deepseek2 671B Q8_0	664.29 GiB	671.03 B	CPU	128	pp512	40.14 ± 1.03
deepseek2 671B Q8_0	664.29 GiB	671.03 B	CPU	128	tg128	4.14 ± 0.02
deepseek2 671B Q8_0	664.29 GiB	671.03 B	CPU	172	pp512	43.21 ± 1.05
deepseek2 671B Q8_0	664.29 GiB	671.03 B	CPU	172	tg128	3.79 ± 0.03
deepseek2 671B Q8_0	664.29 GiB	671.03 B	CPU	216	pp512	44.86 ± 2.81
deepseek2 671B Q8_0	664.29 GiB	671.03 B	CPU	216	tg128	3.48 ± 0.02
deepseek2 671B Q8_0	664.29 GiB	671.03 B	CPU	256	pp512	44.41 ± 1.85
deepseek2 671B Q8_0	664.29 GiB	671.03 B	CPU	256	tg128	3.24 ± 0.02
deepseek2 671B Q8_0	664.29 GiB	671.03 B	CPU	296	pp512	40.30 ± 1.92
deepseek2 671B Q8_0	664.29 GiB	671.03 B	CPU	296	tg128	2.93 ± 0.06

build: a800ae46 (4783)

Intel oneAPI Base Toolkit

# https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html?packages=oneapi-toolkit&oneapi-toolkit-os=linux&oneapi-lin=apt
# Docker Image
# $ docker pull intel/oneapi-basekit:latest

# Manual Install

sudo apt-get update
sudo apt install -y gpg-agent wget
# download the key to system keyring
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \
    | gpg --dearmor \
    | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null
# add signed entry to apt sources and configure the APT client to use Intel repository:
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" \
    | sudo tee /etc/apt/sources.list.d/oneAPI.list
sudo apt-get update

sudo apt-get install intel-oneapi-base-toolkit
# Need to get 2,353 MB of archives.
# After this operation, 12.9 GB of additional disk space will be used.
# Do you want to continue? [Y/n]


cd llama.cpp
source /opt/intel/oneapi/setvars.sh

cmake -S . \
    -B ./build_intel \
    -G Ninja \
    -DCMAKE_BUILD_TYPE=Release \
    -DGGML_BLAS=ON \
    -DGGML_BLAS_VENDOR=Intel10_64lp \
    -DCMAKE_C_COMPILER=icx \
    -DCMAKE_CXX_COMPILER=icpx \
    -DGGML_NATIVE=ON \
    -DCMAKE_INSTALL_PREFIX=./build_intel/ \
    -DLLAMA_BUILD_TESTS=OFF \
    -DLLAMA_BUILD_EXAMPLES=ON \
    -DLLAMA_BUILD_SERVER=ON \
    GGML_CCACHE=OFF

cmake --build ./build_intel --config Release -j $(nproc)

RPC Experiment Take 1

I managed to use the experimental llama.cpp RPC backend feature. Essentially I started up a process on one CPU socket with 200GB RAM available. Then I started llama-server on the other CPU socket specifying the RPC endpoint as a --device so it acts like a GPU. Then you "offload" layers to the RPC device. This was nice in that I could keep all processing local to a single NUMA node on each CPU socket. However, the performance was much worse likely due to synchronous stages stalling and barely utilizing CPU cores as described in this github issue comment. It got maybe 2 tok/sec after fully warmed up... :oof:

UPDATE I realized the rpc-server uses GGML_DEFAULT_N_THREADS which was set to 4. Write and test patch to force CPU backend and specify threads.
UPDATE 2 I made the patches and tried more stuff over in this github issue comment but still no better...

# First start-up RPC backend which seems to malloc in model weights
numactl -N 3 -m 3 -C 128-170,384-426 \
    ./build_amx/bin/rpc-server \
    --mem 200000 \
    --host 127.0.0.1 \
    --port 50052"

# Next start-up llama-server and offload about half the layers to the above RPC backend "GPU" device
numactl -N 0 -m 0 -C 0-42,256-298 \
    ./build_amx/bin/llama-server \
    --model /mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf \
    --cache-type-k f16 \
    --cache-type-v f16 \
    --numa numactl \
    --rpc 127.0.0.1:50052 \
    --n-gpu-layers 30 \
    --device "RPC[127.0.0.1:50052]" \
    --threads 43 \
    --host 127.0.0.1 \
    --port 8080

# I confirmed with numastat that memory is loading only into expected nodes 0, and 3...

# I confirmed with `bmon` that loopback device had about 2MiB/s of spiky traffic flowing back and forth.

RPC Experiment Take 2

Trying to keep MoE block processing and memory in single NUMA node, but this experiment got only 2 tok/sec.

# start the rpc-server backends
# mem 1 is fine as it allocates whatever you send it anyway and seems to ignore that value
echo "Starting llama.cpp rpc-server backend for each NUMA node except for node 0."
for node in {1..5}
do
    CMD="numactl -N $node -m $node \
        $RPC_SERVER \
        --mem 1 \
        --threads 16 \
        --host 127.0.0.1 \
        --port 5005$node"
    echo $CMD
    $CMD &
    sleep 0.25
done

# start the front end in node 0 and connect to backends
numactl -N 0 -m 0 \
    ./build_amx/bin/llama-server \
    --model /mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf \
    --ctx-size 2048 \
    --rpc 127.0.0.1:50051,127.0.0.1:50052,127.0.0.1:50053,127.0.0.1:50054,127.0.0.1:50055 \
    --device RPC[127.0.0.1:50051],RPC[127.0.0.1:50052],RPC[127.0.0.1:50053],RPC[127.0.0.1:50054],RPC[127.0.0.1:50055] \
    --n-gpu-layers 55 \
    --tensor-split 11,11,11,11,11 \
    --threads 16 \
    --numa numactl \
    --host 127.0.0.1 \
    --port 8080

$ watch numastat -p $(pidof llama-server)
Per-node process memory usage (in MBs) for PID 3445558 (llama-server)
                           Node 0          Node 1          Node 2          Node 3          Node 4          Node 5           Total
                  --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Huge                         0.00            0.00            0.00            0.00            0.00            0.00            0.00
Heap                        57.59            0.00            0.00            0.00            0.00            0.00           57.59
Stack                        0.04            0.00            0.00            0.00            0.00            0.00            0.04
Private                   5351.27            0.04            0.00            1.43            0.37            0.00         5353.10
----------------  --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Total                     5408.90            0.04            0.00            1.43            0.37            0.00         5410.73

$ watch numastat -m -v -z
Per-node system memory usage (in MBs):
                          Node 0          Node 1          Node 2          Node 3          Node 4          Node 5           Total
                 --------------- --------------- --------------- --------------- --------------- --------------- ---------------
MemTotal               257688.18       258018.79       258019.46       258018.79       258018.79       257949.26      1547713.25
MemFree                 46505.82       214036.19       214673.23       214467.01       213637.14       214257.29      1117576.68
MemUsed                211182.36        43982.60        43346.22        43551.77        44381.64        43691.97       430136.56
SwapCached                  1.07            0.05            0.00            0.76            0.46            0.02            2.37
Active                   3285.04        42158.39        42133.50        42174.35        42236.99        42144.80       214133.07
Inactive               204611.55            2.41            0.36            7.46           17.27            0.00       204639.05
Active(anon)             3273.52        42133.48        42130.91        42149.64        42220.11        42142.60       214050.25
Inactive(anon)              6.47            0.47            0.00            0.00            0.28            0.00            7.22
Active(file)               11.52           24.91            2.59           24.71           16.89            2.20           82.82
Inactive(file)         204605.08            1.95            0.36            7.46           16.99            0.00       204631.83
Unevictable                33.43            1.52            0.00            0.00            0.41            0.00           35.36
Mlocked                    24.64            1.52            0.00            0.00            0.41            0.00           26.57
Dirty                       0.11            0.58            0.00            0.00            0.01            0.00            0.69
FilePages              204633.19           28.45            3.71           33.46           37.70            2.84       204739.36
Mapped                   2147.81           25.73            2.95           17.93           32.04            2.21         2228.68
AnonPages                3296.90        42133.88        42130.15        42148.52        42217.05        42138.66       214065.16
Shmem                       8.85            0.02            0.75            0.54            2.95            0.62           13.73
KernelStack                22.03           12.82           12.64           12.64           13.33           12.27           85.73
PageTables                 17.39           82.89           82.65           83.38           83.54           82.67          432.53
Slab                     2037.31          499.05          209.41          335.96          912.96          352.09         4346.80
SReclaimable              537.65           31.09           21.68           45.86           53.04           34.64          723.97
SUnreclaim               1499.66          467.96          187.73          290.11          859.92          317.45         3622.83
AnonHugePages            3226.00        42124.00        42120.00        42122.00        42192.00        42122.00       213906.00
KReclaimable              537.65           31.09           21.68           45.86           53.04           34.64          723.97

RPC Experiment Take 3

# start rpc-server backends and frontend for `Q8_0` and 42 threads per node
$ watch numastat -p $(pidof llama-server)
Per-node process memory usage (in MBs) for PID 3471371 (llama-server)
                           Node 0          Node 1          Node 2          Node 3          Node 4          Node 5           Total
                  --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Huge                         0.00            0.00            0.00            0.00            0.00            0.00            0.00
Heap                        53.95            0.00            0.00            0.00            0.00            0.00           53.95
Stack                        0.16            0.00            0.00            0.00            0.00            0.00            0.16
Private                     55.26            0.04            0.00            1.43            0.37            0.00           57.09
----------------  --------------- --------------- --------------- --------------- --------------- --------------- ---------------
Total                      109.36            0.04            0.00            1.43            0.37            0.00          111.20

$ watch numastat -m -v -z
Per-node system memory usage (in MBs):
                          Node 0          Node 1          Node 2          Node 3          Node 4          Node 5           Total
                 --------------- --------------- --------------- --------------- --------------- --------------- ---------------{
MemTotal               257688.18       258018.79       258019.46       258018.79       258018.79       257949.26      1547713.25
MemFree                  6659.93       112630.79       114129.57       114020.59       124956.64       125683.71       598081.22
MemUsed                251028.24       145388.00       143889.89       143998.20       133062.15       132265.55       949632.02
SwapCached                  1.16            0.05            0.00            0.76            0.46            0.02            2.45
Active                    116.76       143464.25       142374.88       142412.77       130665.94       130701.54       689736.14
Inactive               246606.83            9.05            0.36           17.58           43.84            0.03       246677.69
Active(anon)              109.22       143429.32       142372.29       142388.15       130627.51       130699.34       689625.83
Inactive(anon)              6.48            0.47            0.00            0.00            0.28            0.00            7.22
Active(file)                7.54           34.93            2.59           24.62           38.43            2.20          110.31
Inactive(file)         246600.36            8.58            0.36           17.58           43.56            0.03       246670.47
Unevictable                33.43            1.52            0.00            0.00            0.41            0.00           35.36
Mlocked                    24.64            1.52            0.00            0.00            0.41            0.00           26.57
Dirty                       0.05            0.70            0.00            0.00            0.00            0.00            0.76
FilePages              246624.57           45.10            3.71           43.50           85.82            2.87       246805.55
Mapped                     12.80           32.32            2.95           17.95           31.93            2.21          100.15
AnonPages                 132.57       143429.78       142371.53       142387.06       130624.50       130696.25       689641.68
Shmem                       8.85            0.01            0.75            0.54            2.95            0.62           13.73
KernelStack                21.80           13.19           13.07           13.00           13.83           12.67           87.56
PageTables                  6.38          281.34          279.02          279.74          256.81          256.16         1359.46
Slab                     3134.58          499.95          210.12          331.91          920.80          354.58         5451.94
SReclaimable             1613.26           31.82           21.71           40.74           59.92           34.54         1801.97
SUnreclaim               1521.32          468.13          188.41          291.18          860.89          320.04         3649.96
AnonHugePages              64.00       143420.00       142360.00       142358.00       130598.00       130674.00       689474.00
KReclaimable             1613.26           31.82           21.71           40.74           59.92           34.54         1801.97

tensor blk.60.ffn_up_exps.weight buffer type overriden to RPC[127.0.0.1:50055]
load_tensors: tensor 'token_embd.weight' (q8_0) (and 54 others) cannot be used with preferred buffer type AMX, using CPU instead
load_tensors: offloading 61 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 62/62 layers to GPU
load_tensors: RPC[127.0.0.1:50051] model buffer size = 141274.62 MiB
load_tensors: RPC[127.0.0.1:50052] model buffer size = 140224.62 MiB
load_tensors: RPC[127.0.0.1:50053] model buffer size = 140224.62 MiB
load_tensors: RPC[127.0.0.1:50054] model buffer size = 128559.34 MiB
load_tensors: RPC[127.0.0.1:50055] model buffer size = 129015.79 MiB
load_tensors:   CPU_Mapped model buffer size =   938.98 MiB
....................................................................................................
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 2048
llama_init_from_model: n_ctx_per_seq = 2048
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 10000.0
llama_init_from_model: freq_scale    = 0.025
llama_init_from_model: n_ctx_per_seq (2048) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 2048, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 61, can_shift = 0
llama_kv_cache_init: layer 0: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 1: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 2: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 3: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 4: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 5: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 6: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 7: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
.
.
llama_kv_cache_init: layer 59: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: layer 60: n_embd_k_gqa = 24576, n_embd_v_gqa = 16384
llama_kv_cache_init: RPC[127.0.0.1:50051] KV buffer size =  2080.00 MiB
llama_kv_cache_init: RPC[127.0.0.1:50052] KV buffer size =  2080.00 MiB
llama_kv_cache_init: RPC[127.0.0.1:50053] KV buffer size =  2080.00 MiB
llama_kv_cache_init: RPC[127.0.0.1:50054] KV buffer size =  1920.00 MiB
llama_kv_cache_init: RPC[127.0.0.1:50055] KV buffer size =  1600.00 MiB
llama_init_from_model: KV self size  = 9760.00 MiB, K (f16): 5856.00 MiB, V (f16): 3904.00 MiB
llama_init_from_model:        CPU  output buffer size =     0.49 MiB
llama_init_from_model: RPC[127.0.0.1:50051] compute buffer size =   670.00 MiB
llama_init_from_model: RPC[127.0.0.1:50052] compute buffer size =   670.00 MiB
llama_init_from_model: RPC[127.0.0.1:50053] compute buffer size =   670.00 MiB
llama_init_from_model: RPC[127.0.0.1:50054] compute buffer size =   670.00 MiB
llama_init_from_model: RPC[127.0.0.1:50055] compute buffer size =   670.00 MiB
llama_init_from_model:        CPU compute buffer size =    18.01 MiB
llama_init_from_model: graph nodes  = 5025
llama_init_from_model: graph splits = 42
common_init_from_params: KV cache shifting is not supported for this model, disabling KV cache shifting
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 2048
slot        reset: id  0 | task -1 |
main: model Loaded
.
.
prompt eval time =    6487.93 ms /    98 tokens (   66.20 ms per token,    15.10 tokens per second)
       eval time =  691454.45 ms /  1455 tokens (  475.23 ms per token,     2.10 tokens per second)
      total time =  697942.37 ms /  1553 tokens

`SNC=Disable` Mode

This will give us 1x NUMA Node per CPU Socket.

All following tests done on build: e0331326 (4696) which is essentially sl/custom-tensor-offload and ug/rpc-numa-cpu-backend compiled with RPC enabled, but not actually using any of those features.

Memory Benchmarks

Intel(R) Memory Latency Checker - v3.11b
Measuring idle latencies for sequential access (in ns)...
		Numa node
Numa node	     0	     1	
       0	 130.7	 449.2	
       1	 410.0	 129.4	

Measuring Peak Injection Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :	1108235.0	
3:1 Reads-Writes :	972151.5	
2:1 Reads-Writes :	940099.8	
1:1 Reads-Writes :	928269.2	
Stream-triad like:	918997.2	

Measuring Memory Bandwidths between nodes within system 
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
		Numa node
Numa node	     0	     1	
       0	554843.5	247793.1	
       1	247281.1	552385.5	

Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject	Latency	Bandwidth
Delay	(ns)	MB/sec
==========================
 00000	357.28	1106966.8
 00002	362.94	1108392.3
 00008	363.07	1107547.6
 00015	360.97	1104844.6
 00050	359.09	1102679.2
 00100	307.11	1099803.6
 00200	320.42	1105411.1
 00300	231.07	1007100.3
 00400	188.93	 789261.0
 00500	174.05	 665122.5
 00700	158.95	 487463.0
 01000	150.90	 349530.7
 01300	148.47	 271576.2
 01700	146.67	 209392.6
 02500	144.40	 143857.9
 03500	142.66	 103386.9
 05000	140.57	  72810.8
 09000	139.24	  40768.0
 20000	138.79	  18002.4

Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT  latency	179.7
Local Socket L2->L2 HITM latency	180.2
Remote Socket L2->L2 HITM latency (data address homed in writer socket)
			Reader Numa Node
Writer Numa Node     0	     1	
            0	     -	 433.3	
            1	 413.7	     -	
Remote Socket L2->L2 HITM latency (data address homed in reader socket)
			Reader Numa Node
Writer Numa Node     0	     1	
            0	     -	 425.0	
            1	 422.4	     -

1x CPU `UD-Q2_K_XL`

powerprofilesctl set performance
echo 0 | sudo tee /proc/sys/kernel/numa_balancing
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
numactl -N 0 -m 0 \
    ./build_amx/bin/llama-bench \
    --model /mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf \
    --cache-type-k f16 \
    --cache-type-v f16 \
    --numa numactl \
    --threads 64,16,32,42,54,86,128,144

watch sudo numastat -p $(pidof llama-bench)
Per-node process memory usage (in MBs) for PID 48011 (llama-bench)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                        38.71            0.00           38.71
Stack                        0.05            0.00            0.05
Private                 226265.53            2.56       226268.09
----------------  --------------- --------------- ---------------
Total                   226304.29            2.56       226306.85

Results

model	size	params	backend	ngl	threads	test	t/s
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	RPC	99	64	pp512	30.12 ± 6.05
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	RPC	99	64	tg128	8.21 ± 0.26
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	RPC	99	16	pp512	13.86 ± 0.14
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	RPC	99	16	tg128	5.05 ± 0.02
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	RPC	99	32	pp512	24.58 ± 0.21
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	RPC	99	32	tg128	6.01 ± 0.02
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	RPC	99	42	pp512	28.77 ± 0.33
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	RPC	99	42	tg128	7.51 ± 0.04
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	RPC	99	54	pp512	33.55 ± 0.41
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	RPC	99	54	tg128	7.61 ± 0.03
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	RPC	99	86	pp512	42.79 ± 0.37
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	RPC	99	86	tg128	8.89 ± 0.01
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	RPC	99	128	pp512	52.21 ± 1.17
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	RPC	99	128	tg128	8.13 ± 0.04
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	RPC	99	144	pp512	40.41 ± 0.39
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	RPC	99	144	tg128	6.93 ± 0.05

ik_llama.cpp

numactl -N 0 -m 0 \
./build/bin/llama-bench \
    --model /mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf \
    -ctk f16 -ctv f16 \
    -mla 2 -fa 1 \
    -amb 2048 \
    -fmoe 1 \
    -rtr 1 \
    --numa numactl \
    --threads 64,43,64,86,128,172

Results

model	size	params	backend	threads	fa	mla	amb	rtr	fmoe	test	t/s
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	CPU	64	1	2	2048	1	1	pp512	101.23 ± 0.11
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	CPU	64	1	2	2048	1	1	tg128	9.47 ± 0.01
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	CPU	43	1	2	2048	1	1	pp512	76.69 ± 0.14
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	CPU	43	1	2	2048	1	1	tg128	8.37 ± 0.00
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	CPU	64	1	2	2048	1	1	pp512	98.91 ± 0.19
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	CPU	64	1	2	2048	1	1	tg128	9.32 ± 0.01
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	CPU	86	1	2	2048	1	1	pp512	118.22 ± 0.55
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	CPU	86	1	2	2048	1	1	tg128	9.63 ± 0.00
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	CPU	128	1	2	2048	1	1	pp512	147.49 ± 12.00
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	CPU	128	1	2	2048	1	1	tg128	9.94 ± 0.00
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	CPU	172	1	2	2048	1	1	pp512	113.38 ± 0.68
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	CPU	172	1	2	2048	1	1	tg128	8.78 ± 0.00

2x CPU `UD-Q2_K_XL`

powerprofilesctl set performance
echo 0 | sudo tee /proc/sys/kernel/numa_balancing
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
./build_amx/bin/llama-bench \
    --model /mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-UD-Q2_K_XL/DeepSeek-R1-UD-Q2_K_XL-00001-of-00005.gguf \
    --cache-type-k f16 \
    --cache-type-v f16 \
    --numa distribute \
    --threads 128,86,128,172,215,256,288

watch sudo numastat -p $(pidof llama-bench)
Per-node process memory usage (in MBs) for PID 53293 (llama-bench)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                         2.00           36.89           38.89
Stack                        0.00            0.08            0.08
Private                 109284.57       114241.20       223525.76
----------------  --------------- --------------- ---------------
Total                   109286.57       114278.17       223564.73

Results

model	size	params	backend	ngl	threads	test	t/s
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	RPC	99	128	pp512	29.45 ± 6.08
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	RPC	99	128	tg128	3.71 ± 0.07
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	RPC	99	86	pp512	32.53 ± 0.31
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	RPC	99	86	tg128	5.79 ± 0.03
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	RPC	99	128	pp512	37.18 ± 0.67
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	RPC	99	128	tg128	3.93 ± 0.01
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	RPC	99	172	pp512	39.52 ± 0.26
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	RPC	99	172	tg128	3.60 ± 0.05
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	RPC	99	215	pp512	40.92 ± 0.45
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	RPC	99	215	tg128	2.68 ± 0.05
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	RPC	99	256	pp512	40.69 ± 1.58
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	RPC	99	256	tg128	3.93 ± 0.02
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	RPC	99	288	pp512	37.61 ± 0.35
deepseek2 671B Q2_K - Medium	211.03 GiB	671.03 B	RPC	99	288	tg128	3.42 ± 0.01

1x CPU `Q4_K_M`

powerprofilesctl set performance
echo 0 | sudo tee /proc/sys/kernel/numa_balancing
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
numactl -N 0 -m 0 \
    ./build_amx/bin/llama-bench \
    --model /mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf \
    --cache-type-k f16 \
    --cache-type-v f16 \
    --numa numactl \
    --threads 64,43,64,86,128

watch sudo numastat -p $(pidof llama-bench)
...

Results
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | -------------------: |
| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | RPC | 99 | 64 | pp512 | 24.40 ± 4.96 |
| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | RPC | 99 | 64 | tg128 | 6.95 ± 0.33 |
| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | RPC | 99 | 43 | pp512 | 25.41 ± 0.70 |
| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | RPC | 99 | 43 | tg128 | 5.88 ± 0.04 |
| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | RPC | 99 | 64 | pp512 | 33.83 ± 0.50 |
| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | RPC | 99 | 64 | tg128 | 7.30 ± 0.02 |
| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | RPC | 99 | 86 | pp512 | 39.24 ± 0.94 |
| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | RPC | 99 | 86 | tg128 | 7.82 ± 0.02 |
| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | RPC | 99 | 128 | pp512 | 46.84 ± 1.46 |
| deepseek2 671B Q4_K - Medium | 376.65 GiB | 671.03 B | RPC | 99 | 128 | tg128 | 7.16 ± 0.02 |

ik_llama.cpp

numactl -N 0 -m 0 \
./build/bin/llama-bench \
    --model /mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-Q4_K_M/DeepSeek-R1-Q4_K_M-00001-of-00009.gguf \
    -ctk f16 -ctv f16 \
    -mla 2 -fa 1 \
    -amb 2048 \
    -fmoe 1 \
    -rtr 1 \
    --numa numactl \
    --threads 64,86,128

Results

model	size	params	backend	threads	fa	mla	amb	rtr	fmoe	test	t/s
deepseek2 671B Q4_K - Medium	376.65 GiB	671.03 B	CPU	64	1	2	2048	1	1	pp512	93.08 ± 0.76
deepseek2 671B Q4_K - Medium	376.65 GiB	671.03 B	CPU	64	1	2	2048	1	1	tg128	10.02 ± 0.00
deepseek2 671B Q4_K - Medium	376.65 GiB	671.03 B	CPU	86	1	2	2048	1	1	pp512	114.34 ± 0.67
deepseek2 671B Q4_K - Medium	376.65 GiB	671.03 B	CPU	86	1	2	2048	1	1	tg128	9.87 ± 0.00
deepseek2 671B Q4_K - Medium	376.65 GiB	671.03 B	CPU	128	1	2	2048	1	1	pp512	143.04 ± 7.88
deepseek2 671B Q4_K - Medium	376.65 GiB	671.03 B	CPU	128	1	2	2048	1	1	tg128	9.07 ± 0.00

2x CPU `Q4_K_M`

powerprofilesctl set performance
echo 0 | sudo tee /proc/sys/kernel/numa_balancing
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
TODO

watch sudo numastat -p $(pidof llama-bench)
TODO

Results
TODO

1x CPU `Q8_0`

powerprofilesctl set performance
echo 0 | sudo tee /proc/sys/kernel/numa_balancing
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
numactl -N 0 -m 0 \
    ./build_amx/bin/llama-bench \
    --model /mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-Q8_0/DeepSeek-R1.Q8_0-00001-of-00015.gguf \
    --cache-type-k f16 \
    --cache-type-v f16 \
    --numa numactl \
    --threads 64,43,64,86,128

watch sudo numastat -p $(pidof llama-bench)
Per-node process memory usage (in MBs) for PID 64078 (llama-bench)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                        38.72            0.00           38.72
Stack                        0.05            0.00            0.05
Private                 679564.49            2.59       679567.08
----------------  --------------- --------------- ---------------
Total                   679603.26            2.59       679605.86

Results

model	size	params	backend	ngl	threads	test	t/s
deepseek2 671B Q8_0	664.29 GiB	671.03 B	RPC	99	64	pp512	23.44 ± 7.74
deepseek2 671B Q8_0	664.29 GiB	671.03 B	RPC	99	64	tg128	5.41 ± 0.38
deepseek2 671B Q8_0	664.29 GiB	671.03 B	RPC	99	43	pp512	26.27 ± 0.70
deepseek2 671B Q8_0	664.29 GiB	671.03 B	RPC	99	43	tg128	4.91 ± 0.04
deepseek2 671B Q8_0	664.29 GiB	671.03 B	RPC	99	64	pp512	36.52 ± 0.65
deepseek2 671B Q8_0	664.29 GiB	671.03 B	RPC	99	64	tg128	5.88 ± 0.03
deepseek2 671B Q8_0	664.29 GiB	671.03 B	RPC	99	86	pp512	43.81 ± 0.87
deepseek2 671B Q8_0	664.29 GiB	671.03 B	RPC	99	86	tg128	6.15 ± 0.02
deepseek2 671B Q8_0	664.29 GiB	671.03 B	RPC	99	128	pp512	57.19 ± 2.20
deepseek2 671B Q8_0	664.29 GiB	671.03 B	RPC	99	128	tg128	5.86 ± 0.01

ik_llama.cpp

numactl -N 0 -m 0 \
./build/bin/llama-bench \
    --model /mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-Q8_0/DeepSeek-R1.Q8_0-00001-of-00015.gguf \
    -ctk f16 -ctv f16 \
    -mla 2 -fa 1 \
    -amb 2048 \
    -fmoe 1 \
    -rtr 1 \
    --numa numactl \
    --threads 43,64,86,128

watch sudo numastat -p $(pidof llama-bench)
Per-node process memory usage (in MBs) for PID 1487833 (llama-bench)
                           Node 0          Node 1           Total
                  --------------- --------------- ---------------
Huge                         0.00            0.00            0.00
Heap                        76.18            0.00           76.18
Stack                        0.06            0.00            0.06
Private                 681346.11            3.20       681349.30
----------------  --------------- --------------- ---------------
Total                   681422.36            3.20       681425.55

Results

model	size	params	backend	threads	fa	mla	amb	rtr	fmoe	test	t/s
deepseek2 671B Q8_0	664.29 GiB	671.03 B	CPU	43	1	2	2048	1	1	pp512	77.28 ± 0.14
deepseek2 671B Q8_0	664.29 GiB	671.03 B	CPU	43	1	2	2048	1	1	tg128	6.50 ± 0.00
deepseek2 671B Q8_0	664.29 GiB	671.03 B	CPU	64	1	2	2048	1	1	pp512	107.43 ± 6.55
deepseek2 671B Q8_0	664.29 GiB	671.03 B	CPU	64	1	2	2048	1	1	tg128	7.52 ± 0.00
deepseek2 671B Q8_0	664.29 GiB	671.03 B	CPU	86	1	2	2048	1	1	pp512	110.24 ± 4.70
deepseek2 671B Q8_0	664.29 GiB	671.03 B	CPU	86	1	2	2048	1	1	tg128	7.37 ± 0.00
deepseek2 671B Q8_0	664.29 GiB	671.03 B	CPU	128	1	2	2048	1	1	pp512	152.62 ± 6.02
deepseek2 671B Q8_0	664.29 GiB	671.03 B	CPU	128	1	2	2048	1	1	tg128	7.01 ± 0.00

2x CPU `Q8_0`

powerprofilesctl set performance
echo 0 | sudo tee /proc/sys/kernel/numa_balancing
sync && echo 3 | sudo tee /proc/sys/vm/drop_caches
TODO

watch sudo numastat -p $(pidof llama-bench)
TODO

Results
TODO

chrisjob1021 · 2025-02-26T20:24:09Z

chrisjob1021
Feb 26, 2025

@ubergarm may just want to note in the math that Bpw is converted to bytes from bits with your constant term (/8). Initially confused me. Otherwise great write up!

1 reply

ubergarm Feb 26, 2025
Author

Thanks! I added a note to make it more clear and made it simply bpw instead of subscript. My LaTeX foo is not so hot these days lol...

ice6 · 2025-02-27T01:46:02Z

ice6
Feb 27, 2025

@ubergarm great work you have done! I am living in china, eager to buy such a server. but I found no way to buy it. :(

it seems that the intel amx team are implementing fp8 support. if so, the hardware is really great for llm inference.

I am curious about the concurrent performance.

3 replies

ubergarm Feb 27, 2025
Author

Thanks, I wonder how much they cost compared to the MSRP sticker cost?

The best current performance would come from ktransformers with a one or two 4090D depending on how much context you need. You need enough RAM to hold the entire model twice if you go with a dual socket system.

I have benchmarked the best I could get with llama.cpp main branch today. Hopefully there are some easy optimizations that I missed. If the Intel AMX team implements fp8 right on the CPU that could get interesting for sure!

谢谢，我想知道它们的价格与厂商建议零售价相比如何？

目前最佳性能来自配备一个或两个4090D的ktransformers，具体取决于您所需的上下文长度。如果采用双插槽系统，您需要足够的内存来容纳整个模型两次。

目前我已经对llama.cpp主分支进行了基准测试，得到了目前我能获得的最佳结果。希望还有一些我遗漏的简单优化方法。如果英特尔AMX团队能在CPU上正确实现浮点8精度，那肯定会非常有趣！

KimmyGLM Mar 9, 2025

Thanks, I wonder how much they cost compared to the MSRP sticker cost?

The best current performance would come from ktransformers with a one or two 4090D depending on how much context you need. You need enough RAM to hold the entire model twice if you go with a dual socket system.

I have benchmarked the best I could get with llama.cpp main branch today. Hopefully there are some easy optimizations that I missed. If the Intel AMX team implements fp8 right on the CPU that could get interesting for sure!

谢谢，我想知道它们的价格与厂商建议零售价相比如何？

目前最佳性能来自配备一个或两个4090D的ktransformers，具体取决于您所需的上下文长度。如果采用双插槽系统，您需要足够的内存来容纳整个模型两次。

目前我已经对llama.cpp主分支进行了基准测试，得到了目前我能获得的最佳结果。希望还有一些我遗漏的简单优化方法。如果英特尔AMX团队能在CPU上正确实现浮点8精度，那肯定会非常有趣！

now
intel next-gen 7rd Xeon Dimond Rapids will support FP8 through new AMX.

ubergarm Mar 9, 2025
Author

有趣的是，CPU与GPU的指令集及内存支持技术似乎正呈现融合趋势。英特尔Diamond Rapids平台引入amx_fp8指令集，而近期传闻称Bolt的Zeus GPU将基于RISC-V小芯片组架构并搭载DDR5内存？此类技术演进确实值得关注！

interesting how there seems to be a convergence between CPU and GPU instruction sets and memory support. Intel Diamond Rapids with amx_fp8 and now supposedly Bolt's Zeus GPU with DDR5 RAM on RISC-V chiplets? interesting times!

bjodom · 2025-02-27T03:53:22Z

bjodom
Feb 27, 2025

Nice writeup, I too have access to a similar system for a bit. I did not have much luck getting deepseekR1 to convert to gguf or I would just hand you my numbers. However, have you tried building llama.cpp with oneAPI Intel compiler? It will use oneMKL as the BLAS backend.

Steps Are:

Get the Intel Base Kit
Once installed initialize the environment

source /opt/intel/oneapi/setvars.sh

I used the following cmake build line:

cmake -S . -B build -G Ninja  -DCMAKE_BUILD_TYPE=Release -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=Intel10_64lp -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DGGML_NATIVE=ON -DCMAKE_INSTALL_PREFIX=~/llama_build/ -DLLAMA_BUILD_TESTS=OFF -DLLAMA_BUILD_EXAMPLES=ON -DLLAMA_BUILD_SERVER=ON GGML_CCACHE=OFF

Build

cmake --build build --config Release

If all goes well you should have a highly optimized llama.cpp

3 replies

ubergarm Feb 27, 2025
Author

Great this is the kind of thing I figured existed somewhere, but I'm not super familiar with Intel ecosystem. I'll try this out and see if it helps!

UPDATE: I tried it an no gains and pp was much slower oddly. I show the instructions as well as the exact benchmark command if possibly there is something I missed. I only tested a single NUMA node hoping to minimize variability with NUMA stuff. The Intel Base Kit Compiler may actually be faster on some workloads but this particular one is likely memory bandwidth bottle-necked and not CPU performance.

jzz001 Mar 4, 2025

Can you guys share how to compile AMX on Windows?
大佬们能分享一下在windows上编译amx的方法吗？

dc740 Mar 4, 2025

I'm also going through these discussions and had the same results. OpenMP (the default cpu backend) performed exactly the same as the intel one. The bottleneck is not on the CPU (even though they go up to 100% when running)

fairydreaming · 2025-02-27T16:46:13Z

fairydreaming
Feb 27, 2025
Collaborator

Some time ago I started similar discussion with numbers for AMD Epyc: #11733

I identified several areas where performance is reduced because of NUMA and got discouraged by the effort needed to straighten this up (I guess I'm not the first one).

4 replies

ubergarm Feb 27, 2025
Author

I updated the original post with links to your amazing thread with interesting hacks like warming up with token generation first.

ubergarm Mar 1, 2025
Author

I tried having multiple CPU rpc-server backends in different NUMA nodes, but seems like they default to 4 threads GGML_DEFAULT_N_THREADS. I might try to update the rpc-server.cpp example and add an option to specify threads and force CPU backend. Then maybe start up N rpc-servers, one in each NUMA node lol...

Readon Mar 2, 2025

I tried having multiple CPU rpc-server backends in different NUMA nodes, but seems like they default to 4 threads GGML_DEFAULT_N_THREADS. I might try to update the rpc-server.cpp example and add an option to specify threads and force CPU backend. Then maybe start up N rpc-servers, one in each NUMA node lol...

I guess it could not help a lot, the existing splitting strategy is by layer, which could not speed things up on single request. may be on batch mode.

ubergarm Mar 2, 2025
Author

I have one more testing loading right now with -nkvo, --no-kv-offload disable KV offload as I realized the kv cache was on the RPC backend with the exps's... 🤞

Nope, no 🎲 ... Still really slow unfortunately. Updated notes in #11397 (comment)

dc740 · 2025-03-04T19:08:10Z

dc740
Mar 4, 2025

I've experienced similar results running Deepseek R1 (unsloth 2.22b). Regarding the comment about SMT being faster than having it disabled, this was a property of intel cpus and the opposite happened with AMD. For details on this check the paper "Placement of Virtual Containers on NUMA systems: A Practical and Comprehensive Model"

Now... I'd love to be able to have both numa nodes working, since using them both produces less tokens/s than using a single node with half the processes.

Right now I found the best performance disabling ALL cuda devices, using numactl to use a single node, turning off the cpu security mitigations, setting the k cache type to fp16, disabling autobalancing,(still didn´t try adding interleave). On my hardware I'm getting around 2.3 tokens/s. Today I've been having to also disable mmap, otherwise the IO halts the entire process (this wasn't the case yesterday. I need to investigate).
I'm also getting a VERY VERY bad t/s if I offload some layers to gpu. Maybe following these recommendations it could work better #11397 (comment)

My hardware:

Dell R730XD
2x Intel Xeon E5 2699 V4 (22 cores, 44 threads on each cpu. 88 threads total)
1TB DDR4 RAM
Nvidia Tesla P40 (24GB)

6 replies

ubergarm Mar 5, 2025
Author

@ice6

What is the output of numactl --hardware --cpu-compress | grep -v free for your system?

I assume it is a total of 4x NUMA nodes across 2x sockets? That would mean each node has 384.0 GB RAM which is barely enough to fit the entire int4 (~377 GiB) ?

I agree memory bandwidth is key. It would be nice if there were a way to take advantage of the aggregate bandwidth across all NUMA nodes for single user generation situation. The only viable solution of which I know today is to use AMD Epyc with BIOS set to NPS0.

On the larger dual socket 6980P there are 6x total NUMA nodes, each has only 256GB in my 1.5TB RAM setup so I can't even fit that quant into a single node haha...

We'll see how the tensor parallel, parallel pipeline, and specialized distributed MoE "data parallel" features pan out across the different LLM inference engines.

ubergarm Mar 7, 2025
Author

@dc740 @ice6

I just learned about Intel Xeon SNC vs HEX/UMA mode. There is a benchmark on phoronix about it and the default is SNC which gives either 2 or 3 NUMA nodes per CPU socket.

Given none of the inference engines seem optimized for multiple NUMA right now, I want to try to get my test system BIOS set to SNC=Disable for 1x NUMA node per CPU socket. This seems to be roughly equivalent of AMD Epyc NPS1.

At least the next couple months will probably be the best option for now in combination with loading the entire model weights into RAM twice (once per socket). There is a very experimental proposal for it here

Eventually Tensor Parallel / Data Parallel + MoE Parallel / Pipeline Parallel type implementations might allow for better performance optimizing across more smaller NUMA nodes.

Maybe you and everyone already knows about this and had it disabled already, but I just learned about it! 😅

dc740 Mar 7, 2025

that's too new for my aging system xD
I only have 1 numa per cpu socket. But thanks for the tip! I didn't know about it either

ice6 Mar 8, 2025

@ice6

What is the output of numactl --hardware --cpu-compress | grep -v free for your system?

I assume it is a total of 4x NUMA nodes across 2x sockets? That would mean each node has 384.0 GB RAM which is barely enough to fit the entire int4 (~377 GiB) ?

I agree memory bandwidth is key. It would be nice if there were a way to take advantage of the aggregate bandwidth across all NUMA nodes for single user generation situation. The only viable solution of which I know today is to use AMD Epyc with BIOS set to NPS0.

On the larger dual socket 6980P there are 6x total NUMA nodes, each has only 256GB in my 1.5TB RAM setup so I can't even fit that quant into a single node haha...

We'll see how the tensor parallel, parallel pipeline, and specialized distributed MoE "data parallel" features pan out across the different LLM inference engines.

numactl --hardware --cpu-compress | grep -v free
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62
node 0 size: 773440 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63
node 1 size: 645060 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

heart broken, in the node 1, two memory slots breaks :(

ubergarm Mar 9, 2025
Author

@ice6
node 1 size: 645060 MB oof... 😢

KimmyGLM · 2025-03-08T14:36:48Z

KimmyGLM
Mar 8, 2025

hello:
you can try Meituan new INT8 Deepseek model with your 6980p amx , INT8 is in the intel amx support list. .

https://huggingface.co/meituan/DeepSeek-R1-Block-INT8

8 replies

ubergarm Mar 10, 2025
Author

@KimmyGLM

here is my benchmark (8592es*2+768G 5600D5 ）

Thanks for sharing! Look like your best speed on the unsloth Q4_K_M is roughly 6.2 tok/sec generation. Just a little bit slower than the 6980P.

A couple questions:

I assume you were using a single CPU socket configured in a single NUMA node e.g. numactl -N 0 -m 0 llama-bench --numa numactl ... ?
I saw your post over on meituan. afaict you can't actually run that 128x128 block tiled int8 on llama.cpp? I think it is for CUDA backend maybe? I don't think llama.cpp backend has an optimized distributed gemm (generalized matrix multiply) that uses the amx_tile instruction (it is limited to 16x64 byte tile size). Do you know of any inference engine that supports this amx_tile optimization? Even intel themselves does not give much info on TMUL and is no longer supporting that demo code...

Thanks!

感谢分享！您在unsloth Q4_K_M配置下的最佳生成速度约为6.2 tokens/秒，略低于6980P处理器的表现。以下有几个问题请教：

您是否在使用单CPU插槽配置于单一NUMA节点？例如通过numactl -N 0 -m 0 llama-bench --numa numactl ...这类指令实现的配置？
在美团社区注意到您的讨论。据分析，目前llama.cpp似乎无法运行您提到的128x128分块处理的int8矩阵运算？这可能仅限于CUDA后使用。推测llama.cpp后端尚未集成基于amx_tile指令的分布式通用矩阵乘法（GEMM）优化方案（当前AMX指令受限于16x64字节的块尺寸）。您是否了解支持此类amx_tile优化的推理引擎？值得注意的是，英特尔官方对TMUL的代码示例披露有限，且已停止维护相关演示代码...

感谢！

KimmyGLM Mar 11, 2025

1.numactl --interleave=all \ , and I also test disabled NUMA in BIOS but benchmark is a little bit lower than above setting . ( 6token/s)
In my option, default numa setting and one cpu socket maybe the best selection to achieve about 5-6token/s in Chat (not benchmark)

2.Meituan INT8 model seems not compatible to popular Tool like llama.cpp and Vllm, just limited to SLANG; well ,i also try to connect with meituan in other approches.( like Chinese Model site and Wechat) But still no responese currenty.

I'd like to test this project, it has 8token/s in Linkedin,very intesting~but flash moe tool has no more info or details to present.
Do you have any details to share?
https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quickstart/llamacpp_portable_zip_gpu_quickstart.md#flashmoe-for-deepseek-v3r1

https://www.linkedin.com/posts/jasondai_with-the-latest-ipex-llm-llamacpp-portable-activity-7303194182729244673-FcxL/

KimmyGLM Mar 11, 2025

@KimmyGLM

here is my benchmark (8592es*2+768G 5600D5 ）

Thanks for sharing! Look like your best speed on the unsloth Q4_K_M is roughly 6.2 tok/sec generation. Just a little bit slower than the 6980P.

A couple questions:

I assume you were using a single CPU socket configured in a single NUMA node e.g. numactl -N 0 -m 0 llama-bench --numa numactl ... ?

I saw your post over on meituan. afaict you can't actually run that 128x128 block tiled int8 on llama.cpp? I think it is for CUDA backend maybe? I don't think llama.cpp backend has an optimized distributed gemm (generalized matrix multiply) that uses the amx_tile instruction (it is limited to 16x64 byte tile size). Do you know of any inference engine that supports this amx_tile optimization? Even intel themselves does not give much info on TMUL and is no longer supporting that demo code...

Update my combination test: Dual CPU Socket for Q4_k_m

numactl --interleave=all \ + distribute \ is the best one;
numactl -N 0 -m 0 \ or -N 0 -b -m 0 -C 0-63\ is the worst,only 1.8token/s
Disable BIOS Numa had little influence to 6token/s;
AMX seems has no acceleration )LOL

BTW, how to disable one CPU in BIOS? just like your Single Socket Test.

ice6 Mar 11, 2025

maybe amx is not transparent to developer, developer have to optimize for it?

ubergarm Mar 11, 2025
Author

@KimmyGLM @ice6

AMX seems has no acceleration )LOL

maybe amx is not transparent to developer, developer have to optimize for it?

Yeah, I believe that llama.cpp does make use of amx_int8 CPU flag depending on the exact quantization (q8_0 can use more):

load_tensors: tensor 'token_embd.weight' (q4_K) (and 850 others) cannot be used with preferred buffer type AMX, using CPU instead
load_tensors: tensor 'token_embd.weight' (q4_K) (and 850 others) cannot be used with preferred buffer type AMX, using CPU instead
load_tensors: tensor 'token_embd.weight' (q8_0) (and 54 others) cannot be used with preferred buffer type AMX, using CPU instead

...
# Q8_0
load_tensors:          AMX model buffer size = 18214.39 MiB
load_tensors:   CPU_Mapped model buffer size = 45565.90 MiB
load_tensors:   CPU_Mapped model buffer size = 46661.11 MiB
.
.
.

However, I don't believe it makes use of amx_tile and would likely require a developer to create an GeMM algorithm function optimized exactly to the hardware capabilities, cache size, CPU flags, NUMA, etc. My guess is it doesn't matter that much as the bottle neck on many systems is memory bandwidth. There is one guy who has a special unreleased version doing exactly this for his AMD Epyc hardware here: #11397 (comment) but he has issues with OpenMP threads not working with llama-server here: #12171

@KimmyGLM

how to disable one CPU in BIOS? just like your Single Socket Test.

I only know for Intel Xeon 6th generation. Go into BIOS and set SNC=Disable and you will get 1x NUMA node per CPU socket. This is equal to AMD NPS1. Intel Xeon does not seem to support any mode similar to AMD Epyc NPS0 mode unfortunately.

I updated my benchmarks above and setting 1x "big" NUMA node per CPU is faster than --interleave ... --distribute in my testing. Of course my test CPU had 3x NUMA nodes per CPU socket before for a total of 6 which is even less optimized haha...

Thanks for sharing info!

aubreyli · 2025-03-12T00:55:50Z

aubreyli
Mar 12, 2025

There are some OS kernel tweaks that might be worth trying:

Check if deep c-states (C6 and C6P) are enabled on your system, the theory here is that if you keep SMT enabled and let it enter deep sleep, then its sibling on the same core will use its power budget, allowing it to reach a higher frequency. turbostat/powertop is the tool to check.
Spectre / Meltdown Mitigation has significant performance penalties, it's worth turning it off and trying.
There are some Linux kernel scheduler tweaks, /sys/kernel/debug/sched/migration_cost_ns is the one most worth trying, setting a larger value can prevent the kernel from frequently migrating llama.cpp threads between different CPU Cores.

0 replies

aubreyli · 2025-03-12T03:01:08Z

aubreyli
Mar 12, 2025

AMX tile config is here in llama.cpp
And AMX MUL_MAT is here

If the tensor OP type is GGML_OP_MUL_MAT, it will be invoked on Intel AMX supported platform.

0 replies

KimmyGLM · 2025-03-17T11:42:50Z

KimmyGLM
Mar 17, 2025

hello:

    here is my recent tests in Ktransformers. Very interesting:

"export NUMA=1 "must be set before building KT project otherwise it will not be configured in KT;
NUMA=1 setting can use the total 768G D5 , Token speed about 10.5 token/S;but pulling the whole LLM into DRAM will take quite long time to run (double usage , double pull time )
Compared with NUMA=1, I also test the normal setting just just use about half D5 (400G around); Token speed about 8-9token/S
Not so much gap~;

what's more, llama.cpp seems more sophisticated in DRAM setting, it's prefere using Dram Cache instead of direct DRAM injecting.
I like this method because it still be there if i shutdown the CHAT, which save a lot of time .

0 replies

Intel Xeon performance on R1 671B quants? #12088

Intel Xeon performance on R1 671B quants?

tl;dr;

Overview

6980P Benchmarks

Related Issues

ktransformers

Theory

Discussion

Conclusion

Methodology and Notes

Methodology and Notes

System Information

Memory Benchmarks

Compile AMX Extensions

Single NUMA Node UD-Q2_K_XL Benchmarking Stock Compiler

Single NUMA Node UD-Q2_K_XL Benchmarking Intel Base Kit Compiler

Single Socket UD-Q2_K_XL Benchmarking

Single Socket Q4_K_M Benchmarking

Dual Socket Q4_K_M Benchmarking Take 1

Dual Socket Q4_K_M Benchmarking Take 2

Single Socket Q8_0 Benchmarking

Dual Socket Q8_0 Benchmarking

Intel oneAPI Base Toolkit

RPC Experiment Take 1

RPC Experiment Take 2

RPC Experiment Take 3

SNC=Disable Mode

Memory Benchmarks

1x CPU UD-Q2_K_XL

2x CPU UD-Q2_K_XL

1x CPU Q4_K_M

2x CPU Q4_K_M

1x CPU Q8_0

2x CPU Q8_0

Replies: 9 comments · 25 replies

ubergarm Feb 26, 2025 Author

ubergarm Feb 27, 2025 Author

ubergarm Mar 9, 2025 Author

ubergarm Feb 27, 2025 Author

fairydreaming Feb 27, 2025 Collaborator

ubergarm Feb 27, 2025 Author

ubergarm Mar 1, 2025 Author

ubergarm Mar 2, 2025 Author

ubergarm Mar 5, 2025 Author

ubergarm Mar 7, 2025 Author

ubergarm Mar 9, 2025 Author

ubergarm Mar 10, 2025 Author

ubergarm Mar 11, 2025 Author

Single NUMA Node `UD-Q2_K_XL` Benchmarking Stock Compiler

Single NUMA Node `UD-Q2_K_XL` Benchmarking Intel Base Kit Compiler

Single Socket `UD-Q2_K_XL` Benchmarking

Single Socket `Q4_K_M` Benchmarking

Dual Socket `Q4_K_M` Benchmarking Take 1

Dual Socket `Q4_K_M` Benchmarking Take 2

Single Socket `Q8_0` Benchmarking

Dual Socket `Q8_0` Benchmarking

`SNC=Disable` Mode

1x CPU `UD-Q2_K_XL`

2x CPU `UD-Q2_K_XL`

1x CPU `Q4_K_M`

2x CPU `Q4_K_M`

1x CPU `Q8_0`

2x CPU `Q8_0`

Replies: 9 comments 25 replies

ubergarm Feb 26, 2025
Author

ubergarm Feb 27, 2025
Author

ubergarm Mar 9, 2025
Author

ubergarm Feb 27, 2025
Author

fairydreaming
Feb 27, 2025
Collaborator

ubergarm Feb 27, 2025
Author

ubergarm Mar 1, 2025
Author

ubergarm Mar 2, 2025
Author

ubergarm Mar 5, 2025
Author

ubergarm Mar 7, 2025
Author

ubergarm Mar 9, 2025
Author

ubergarm Mar 10, 2025
Author

ubergarm Mar 11, 2025
Author