Will there ever be a GPU support for Apple Silicon? #175

alexcardo · 2023-03-15T16:06:51Z

alexcardo
Mar 15, 2023

I really thank you for the possibility of running the model on my MacBook Air M1. I've been testing various parameters and I'm happy even with the 7B model. However, do you plan to utilize the GPU of M1/M2 chip? Thank you in advance.

gjmulder · 2023-03-15T20:08:55Z

gjmulder
Mar 15, 2023
Collaborator

Not any time soon. The code is written to work extremely fast on CPUs, specifically x86_64 and ARM64 systems, and not GPUs.

The M1/M2 GPUs probably do not have enough memory to load the models, but check out FB's original LLaMM inference code to run FB's inference engine and DeepSpeed for techniques to distribute models across both CPU and GPU memory. This can get very involved.

5 replies

v3ss0n Mar 15, 2023

Very interesting , But i doubt the gains on deepspeed.

philipturner Mar 16, 2023

The M1/M2 GPUs probably do not have enough memory to load the models

That is false. The M1/M2 GPU has unified memory, where it shares memory with the CPU. @alexcardo said he could run the model on his CPU, which means he can also theoretically run it on his GPU. I'm planning to do so myself for transfer learning; it just requires using PyTorch or MPSGraph.

However, GPU acceleration won't provide much benefit, even on Nvidia. This is a memory-constrained model during inference, where FLOPs (plural) ~ parameters (OpenAI, 2020). GPU might provide more integer compute power for unpacking parameters. The M1's bandwidth would allow streaming 134 billion parameters/second @ 4 bits, or 268 billion FLOPS (singular). Your P-CPU's NEON ALUs support 4x3.0x512/32x2=384 billion FLOPS (singular), more than allowed from memory bandwidth. Going to the AMX or GPU would not improve execution speed.

The LLaMa-7B model requires 13.1-13.2 GFLOPs (plural) to inference, depending on whether context=512 or context=1024. One token or one evaluation would take 13.2/268 = 49 milliseconds if you don't batch it. If you somehow can batch it - as in model training - then the arithmetic intensity increases, where you can harness something like Nvidia or Apple GPUs. So in short, GPU cannot accelerate model inference*. Except on servers, where they batch hundreds of inference requests into a single GPU. I don't think you're going to ask LLaMa-7B 32 questions simultaneously :)

* The only exception is the M1/M2 Max series like mine, where the GPU has double the bandwidth of CPU.

* Other model architectures can harness such accelerators, because they have more arithmetic intensity. One example is convolutional neural networks. One non-example is basic dense feedforward networks.

v3ss0n Mar 16, 2023

Well explained.

stlmx Apr 14, 2023

Could you please introduce to me how to compute the complexity（13.1-13.2 GFLOPs ）？
I am so grateful if you could teach me !

philipturner Apr 14, 2023

The majority of the computations come from feedforward layers. There, each parameter requires one multiply and one add. LLaMa-7B has 6.5 billion parameters. The remaining FLOPs come from attention layers, which are compute-intensive. I looked at the OpenAI research paper and saw that longer sequences require more computations.

It seems there isn't a good implementation of attention on Apple GPUs. Perhaps that is the reason MPS is measurably slower than llama.cpp. Fast attention mechanisms would require kernel fusion, which PyTorch currently doesn't support. The CPU has a very large L2 cache to capture the memory thrashing; the GPU pages to much slower SLC.

bitRAKE · 2023-03-15T21:55:26Z

bitRAKE
Mar 15, 2023

I haven't had an Apple PC since 1987.

Run LLaMA inference on Apple Silicon GPUs.
https://github.com/jankais3r/LLaMA_MPS

1 reply

cmp-nct Mar 27, 2023

You should try to develop in xcode (the only development environment for mac hardware, no cross compiling either).
You'll have the full 1987 experience :-)

alexcardo · 2023-03-16T07:21:33Z

alexcardo
Mar 16, 2023
Author

Thank everyone for the answers. As I realized from the Philip Turner answer, there is no reason to run the model on M1/M2 GPU. In fact, the CPU inference speed in my case is satisfactory. Moreover, I read a lot about garbage outputs. I managed to get pretty good results, at least similar to GPT 3 on the 7B model. Also I realized that there will be no way to fine tune the model as it requires too many resources to accomplish this task.

My goal is actually to try this model for the paraphrasing tasks. I've been trying various prompts, but unlike summarization, the model seemingly can't realize this task out of the box. On the other hand, it has enough potential to do that, and, given that it could be run on CPU, it may be used in the online tool. I suppose it will outperform the T5 model sufficiently.

1 reply

philipturner Mar 16, 2023

I think you could perform fine-tuning - transfer learning - if you use FP8 training through custom Metal shaders. Or even temporarily unpacking the weights in L2D and sending a data batch to AMX (cblas_sgemm). Nvidia created techniques to use FP8xFP8=FP32 for effective machine learning. Expect the FP8 -> FP16 weight decompression to decrease GPU ALU by ~50%, to 1.3 TFLOPS. Still much faster than the CPU's 380 GFLOPS x 50% = 190 GFLOPS.

gjmulder · 2023-03-16T10:50:50Z

gjmulder
Mar 16, 2023
Collaborator

It is a great apples-to-apples comparison.

People are finding int4 on CPU to be similar in quality, and a significant reduction in the memory requirements. I wonder if the tokens per Wh will improve when the memory I/O is reduced by approx. a factor of 4?

0 replies

stefandeml · 2023-03-24T08:55:27Z

stefandeml
Mar 24, 2023

Can Apple's implementation for Neural Engine Transformers be of any help to use GPU acceleration?
Link: https://github.com/apple/ml-ane-transformers

1 reply

philipturner Mar 24, 2023

The ANE needs higher batch sizes to reach better performance, just like GPU. One idea is to have the user run multiple conversations simultaneously, which is embarrassingly parallel.

Based on the figure above, the latency is improved by a factor of 2.84 times for the sequence length of 128 and batch size of 1 that were chosen for the tutorial. Higher sequence lengths, such as 512, and batch sizes, such as 8, will yield up to 10 times lower latency and 14 times lower peak memory consumption. Please refer to Figure 2 from our research article for detailed and interactive performance data.

Also note that the ANE on M1 Max may only have access to 200 GB/s, while the GPU has 400 GB/s. Can't say for sure. That would make the ANE fundamentally 2x slower than the GPU.

L-as · 2023-04-01T00:26:56Z

L-as
Apr 1, 2023

Running on ANE or GPU, regardless of speed, is possibly more energy efficient, which is a big enough reason on its own when considering mobile devices.

1 reply

philipturner Apr 1, 2023

Now you understand what Apple is thinking...

It's not about raw performance, it's about power efficiency.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Will there ever be a GPU support for Apple Silicon? #175

{{title}}

Replies: 6 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Will there ever be a GPU support for Apple Silicon? #175

Replies: 6 comments · 9 replies

gjmulder Mar 15, 2023 Collaborator

alexcardo Mar 16, 2023 Author

gjmulder Mar 16, 2023 Collaborator

Replies: 6 comments 9 replies

gjmulder
Mar 15, 2023
Collaborator

alexcardo
Mar 16, 2023
Author

gjmulder
Mar 16, 2023
Collaborator