Accelerate CPU-Only inference #12047
Unanswered
ChienYiChi
asked this question in
Q&A
Replies: 1 comment
-
From experience it's best to let llama.cpp pick the best number of threads and if that doesn't work, tell it to use as many threads as you have physical cores, which from what I can tell is 128 so try using --threads 128 instead of --threads 32. If you use more threads than you have physical cores it will run slower because llama.cpp does not benefit from hyperthreading |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
How can I improve the CPU inference speed? On my current machine configuration, deepseek-r1-1.58 is running at 1.8 tokens/s, but I’ve seen others achieve 4.4 tokens/s with the same setup. Has anyone else run CPU inference benchmarks?
CPU Spec
Memory Spec
Model
DeepSeek-R1-1.58bit
Command line
Beta Was this translation helpful? Give feedback.
All reactions