Accelerate CPU-Only inference #12047

ChienYiChi · 2025-02-24T06:31:51Z

ChienYiChi
Feb 24, 2025

How can I improve the CPU inference speed? On my current machine configuration, deepseek-r1-1.58 is running at 1.8 tokens/s, but I’ve seen others achieve 4.4 tokens/s with the same setup. Has anyone else run CPU inference benchmarks?

CPU Spec

Architecture: x86_64
CPU: 256
Threads per core: 2
Cores per socket: 64
Sockets: 2
Model name: AMD EPYC 7H12 64-Core Processor
CPU MHz: 2600.000
Max CPU MHz: 2600.0000
Min CPU MHz: 1500.0000

Memory Spec

Error Information Handle: 0x008E
Total Width: 72 bits
Data Width: 64 bits
Size: 64 GB
Form Factor: DIMM
Set: None
Locator: P2-DIMMH2
Bank Locator: P1_Node0_Channel7_Dimm1
Type: DDR4
Type Detail: Synchronous Registered (Buffered)
Speed: 3200 MT/s
Manufacturer: Samsung
Serial Number: H1QL00030944659D3E
Asset Tag: P2-DIMMH2_AssetTag (date:23/09)
Part Number: M393A8G40AB2-CWE
Rank: 2
Configured Memory Speed: 2933 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V
Memory Technology: DRAM
Memory Operating Mode Capability: Volatile memory
Firmware Version: M393A8G40AB2-CWE
Module Manufacturer ID: Bank 1, Hex 0xCE
Module Product ID: Unknown
Memory Subsystem Controller Manufacturer ID: Unknown
Memory Subsystem Controller Product ID: Unknown
Non-Volatile Size: None
Volatile Size: 64 GB
Cache Size: None
Logical Size: None

Model
DeepSeek-R1-1.58bit

Command line

./llama-cli
--model /mnt/ssd01/models/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf
--threads 32
--prio 2
--n-gpu-layers 0
--temp 0.6
--ctx-size 1024
--seed 3407
--no-display-prompt
--repeat-penalty 1.2
-ngl 62
-ub 2
--prompt "<|User|>Implement a web server for shop.com. The web server should be able to handle the following requests:
- GET /products
- GET /products/:id
- POST /products
- PUT /products/:id
- DELETE /products/:id
<|Assistant|>
" \

ejrydhfs · 2025-02-26T01:10:44Z

ejrydhfs
Feb 26, 2025

~~Have you tried not using "--threads 32"?~~ I believe it happens because you are using the --threads argument with a value of 32, even through you seem to have 128 cores. So the --threads 32 argument will limit llama cpp to only use 32 cores instead of fully using your CPUs.

From experience it's best to let llama.cpp pick the best number of threads and if that doesn't work, tell it to use as many threads as you have physical cores, which from what I can tell is 128 so try using --threads 128 instead of --threads 32.

If you use more threads than you have physical cores it will run slower because llama.cpp does not benefit from hyperthreading

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accelerate CPU-Only inference #12047

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Accelerate CPU-Only inference #12047

ChienYiChi Feb 24, 2025

Replies: 1 comment

ejrydhfs Feb 26, 2025

ChienYiChi
Feb 24, 2025

ejrydhfs
Feb 26, 2025