-
I hope this is the right place to ask, otherwise please advise where to put. Hard to believe the M3 with 30 tokens/s is 2x faster than the Xeon. Is Apple Silicon simply better optimized or what parameters to tweak on the Xeon? (I see threads is at 4 by default and it's a 32 thread CPU.) Thanks for any help |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
I jumped onto Performance of llama.cpp on Apple Silicon M-series which has at least one (eye-watering) result for a dual! Xeon Platinum on Ubuntu 22. |
Beta Was this translation helpful? Give feedback.
-
From what I understood, Apple processors have the DRAM chips soldered very close and running with an extremely low latency. Latency is something critical with LLMs since nothing fits in the cache, so that very likely counts a lot. |
Beta Was this translation helpful? Give feedback.
I jumped onto Performance of llama.cpp on Apple Silicon M-series which has at least one (eye-watering) result for a dual! Xeon Platinum on Ubuntu 22.
OMG!
Will try to run the test with Xeon CPU 2x AVX-512 256GB DDR5 only and
secondly hope the model fits into the 40GB GDDR6 (X?) of the 2x 20GB RTX 4000.
Guess the test can resort to partial offload of layers if the model does not fit into the GPUs.
Would be nice if things fit into ONE GPU to avoid overhead of sharding via the GPUs' PCIe 4.