mixtral-8x7b-instruct-v0.1.Q4_0.gguf perf on MacBook Pro M3 Max 36GB vs Xeon 3435X 256GB 2x 20GB RTX 4000 GPUs #4743

ai-bits · 2024-01-02T22:05:08Z

ai-bits
Jan 2, 2024

I hope this is the right place to ask, otherwise please advise where to put.
In LM Studio I tried mixtral-8x7b-instruct-v0.1.Q4_0.gguf on a MacBook Pro M3 Max 36GB and a Xeon 3435X 256GB 2x 20GB RTX 4000 GPUs and 20 (of the 32) layers offloaded to the 2 GPUs.

Hard to believe the M3 with 30 tokens/s is 2x faster than the Xeon. Is Apple Silicon simply better optimized or what parameters to tweak on the Xeon? (I see threads is at 4 by default and it's a 32 thread CPU.)

Thanks for any help
G.

Answered by ai-bits

Jan 3, 2024

I jumped onto Performance of llama.cpp on Apple Silicon M-series which has at least one (eye-watering) result for a dual! Xeon Platinum on Ubuntu 22.
OMG!
Will try to run the test with Xeon CPU 2x AVX-512 256GB DDR5 only and
secondly hope the model fits into the 40GB GDDR6 (X?) of the 2x 20GB RTX 4000.
Guess the test can resort to partial offload of layers if the model does not fit into the GPUs.
Would be nice if things fit into ONE GPU to avoid overhead of sharding via the GPUs' PCIe 4.

View full answer

ai-bits · 2024-01-03T11:57:25Z

ai-bits
Jan 3, 2024
Author

I jumped onto Performance of llama.cpp on Apple Silicon M-series which has at least one (eye-watering) result for a dual! Xeon Platinum on Ubuntu 22.
OMG!
Will try to run the test with Xeon CPU 2x AVX-512 256GB DDR5 only and
secondly hope the model fits into the 40GB GDDR6 (X?) of the 2x 20GB RTX 4000.
Guess the test can resort to partial offload of layers if the model does not fit into the GPUs.
Would be nice if things fit into ONE GPU to avoid overhead of sharding via the GPUs' PCIe 4.

0 replies

wtarreau · 2024-01-12T09:20:38Z

wtarreau
Jan 12, 2024

From what I understood, Apple processors have the DRAM chips soldered very close and running with an extremely low latency. Latency is something critical with LLMs since nothing fits in the cache, so that very likely counts a lot.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mixtral-8x7b-instruct-v0.1.Q4_0.gguf perf on MacBook Pro M3 Max 36GB vs Xeon 3435X 256GB 2x 20GB RTX 4000 GPUs #4743

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

mixtral-8x7b-instruct-v0.1.Q4_0.gguf perf on MacBook Pro M3 Max 36GB vs Xeon 3435X 256GB 2x 20GB RTX 4000 GPUs #4743

ai-bits Jan 2, 2024

Replies: 2 comments

ai-bits Jan 3, 2024 Author

wtarreau Jan 12, 2024

ai-bits
Jan 2, 2024

ai-bits
Jan 3, 2024
Author

wtarreau
Jan 12, 2024