-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vulkan Acceleration #50
Comments
I've managed to upgrade it to do a dot product between two matrices of x * x As a test I did a dot product between two matrices of 2048 * 2048 and it's using floats for each element. Next I want to upgrade it to handle matrices that are not a multiple of 32 and who are not square, then try and build a compute shader than can do attention with softmax on qkv matrices. The only real tricky thing is building the compute shader correctly, integrating it wouldn't be that hard, the workers/root node could selectively offload certain calculations to be processed in the compute shader by the GPU, and you can account for GPU memory restrictions by splitting the load up into sequential calls to the GPU. As before here is a copy of the code as it is now: |
Great! I was wondering about CUDA as the first accelerator but for Raspberry Pi Vulcan may be a better choice. Please check the llama.cpp repository, they have implemented the matrix multiplication already. |
Yeah Vulkan is nice because it has a wide range of support, not just for SBC's but also for computers with AMD or Nvidia cards. I've basically got dot product multiplication done but there are some nitty gritty difficult issues that I still need to figure out. Also yeah not a bad idea, busy looking at these shaders right now. I've at least gained most of the knowledge now that I need to actually utilize their shaders. |
https://ai.google.dev/edge/lite/microcontrollers/python ? for raspberry pi https://www.tensorflow.org/guide/distributed_training distributed tensorflow? |
I've been working on how I'd integrate it into distributed-llama and I think I have an decent idea of how to go about it. I am working towards getting atleast the 6 matmul functions offloaded to vulkan, then I'll submit a pull request for it, will have to see in practice how well it performs. Ideally I'd have 1 compute shader for all 6, but for simplicity sake I'm going to use 6 different compute shaders, once for each: matmulF32, matmulF16, matmulQ40, matmulQ80, matmulQ40vQ80 & matmulQ80vQ80 |
This is my in progress branch @b4rtaz Getting there... |
Well I actually managed to get vulcan acceleration working! ./vulkan-test Only matmulF32 at the moment, want to these next Once I've got them done as well then I'll do some speed tests to see what kind of an uplift this has. |
Need to figure out what Vulkan extensions I need to enable to support the Q40 and Q80 data types in the compute shader. The actual shader implementation isn't that complicated luckily, but I need to use specific data types. |
I'm very close now, just need to get the compute shader code to work correctly. |
I've successfully gotten matmulQ40Q80 to run via compute shader on Vulkan 🔥 and get results back that are nearly 1 = 1 with CPU calculated results. One last thing I need to figure out, is that if I run it in inference mode, and have more than 1 thread running at a time, then it bugs out, it's not yet multithread capable, but will sort that out soon. In the meantime, I can check how fast it is compared to CPU on a single matmul pass. Just printed the first 32 floats from coming from each: CPU Results: 0.00731812 0.00678142 0.00680221 0.00671218 0.00693316 0.00689348 0.00708914 0.00695994 0.00664946 0.00704365 0.00665891 0.00735391 0.00661354 0.00689124 0.00729823 0.0068318 0.00696582 0.00684787 0.00673844 0.0071383 0.00692065 0.00697429 0.00682781 0.00695222 0.0068927 0.00702631 0.00696984 0.00717608 0.00726813 0.00741034 0.00734587 0.00691799 Vulkan Results: 0.00737184 0.0069693 0.00689848 0.00662084 0.00692127 0.00690478 0.00713998 0.00661354 0.00675865 0.00720126 0.00705041 0.00735674 0.00685534 0.00662042 0.00709984 0.00699035 0.00676422 0.00702015 0.00673056 0.00712065 0.00695963 0.00703757 0.00696171 0.00701772 0.00682509 0.00709918 0.00704923 0.00718651 0.00713319 0.00719681 0.00736942 0.00710319 ✅ matmulQ40Q80 |
I setup linux on another partition so that I could get native int8 GPU functionality without WSL ruining the party. Then I ran some speed tests of a single pass, the shader runs fairly well, at very low matrix dimensions the CPU is actually faster, but I haven't yet been able to test large enough dimensions due to some weird bug that's causing a segfault that I can't figure out why yet. n: number of rows n = 512, d = 256 n = 512, d = 1024 n = 512, d = 3072 |
Raspberry Pi3B+ 22.04 Ubuntu Server Vulkan Information"
|
Nice, so the raspberry pi should be able to support it as well, I will need to adjust workgroup size I think to max 3 from what I see. |
sudo nice -n -20 ./main inference --model ..dllama_original_q40.bin --tokenizer ..dllama-llama3-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 1 Not quite there yet it seems |
It looks like the best approach would be to determine how many layers of weights can be loaded onto GPU memory and if a layer is on GPU memory then process it in Vulkan, else via CPU. This is what my compute shader for Q40Q80 looks like right now, best result so far for a 4096 x 4096 weight matrix and 1 x 4096 input matrix has been 2ms, which is about the same as the CPU matmul. There is a lot I still have to figure out, making good headway though. |
Probably I'm doing something wrong, but I wanted to compare the performance of CPU with Vulkan on my Mac. Llama.cpp has already implemented it so: CPU:
Vulkan:
Additionaly I get some weird characters in the response, so maybe something is broken. @DifferentialityDevelopment could you observe any speed up with Vulcan on llama.cpp? |
Do you have one of those with the unified memory architecture? For me Vulkan is much faster than just CPU inference, as it makes use of my RTX 3060. Also I know llama.cpp just had a patch that supposedly fixes some issues with Vulkan, so might be you just had an older version? |
I've been having a rough time getting Vulkan to work properly and efficiently with distributed Llama. I'm not sure exactly what I'm doing wrong yet. The tests I've run indicate that the Vulkan inference functions are within the margin when compared to the CPU matmul functions. However, when I use the main inference loop, the results are significantly different when using the Vulkan compute shader that handles the QKV using a Vulkan implementation of matmulQ40_Q80. So, I'm not quite sure what's going on. My plan is to offload as many layers as possible to the GPU at startup (possibly configurable manually through a setting). During inference, it can then use the weights already in GPU memory. It will probably take me a month or more to get it working correctly. I will try to keep the Vulkan branch up to date with the main branch as much as possible so that it will be easier to merge when I eventually make a pull request. It already offloads the layers to GPU memory on startup (sort of), but the computation results are still not correct. |
Yeah, maybe CPU is too fast.
This is how it works in llama.cpp, there is |
can i help with testing this? i have some gpus i want to throw into the mix that are not ROCm/CUDA capable, plus i want to try my pi's. |
After some time I achieved a tiny progress, I have a faster shader than a single M1 core:
Unfortunately the same shader on Raspberry Pi:
🤯 The weird thing is that, I noticed
And for both devices I have the same speed. |
Something I have learned is that just copying the data to vulkan buffers isn't the whole picture, there is a bit of a process of moving it to the GPU memory which is where it's much faster. |
Also the warp size, local/global group size etc are also very important to fully utilizing the GPU. Also thinking about it, raspberry pi has no dedicated GPU memory, it uses the system RAM, it kind of explains why your M1 was faster on GPU as the GPU cores can deal with the data in parallel much more efficiently than the CPU can and both the GPU and CPU has access to the memory at the same speed since it uses the unified memory architecture. Still I'm sure the raspberry pi's GPU should be able to do the computations faster than it's CPU can, just wonder how to make it happen, I don't have a raspberry pi to test with myself, but I'm soon going to be able to upgrade to a 4 node setup (PC's) |
Some progress: 🫣
|
What size matrices are you testing it with? |
This requires around 229448 kB in memory (total size of I'm trying to implement matrix x vector multiplication. The size is basically taken from Llama model. |
https://github.com/LostRuins/koboldcpp/tree/318d5b87fc1602ef16d8271bfdd937ef416a8182/include/vulkan https://github.com/Const-me/Cgml might offer some insight, although primarily for directx3d https://github.com/CNugteren/CLBlast seems to be the consensus on embedded and amd hardware |
The reason why Vulkan is slow is here:
Vulkan drivers for the Raspberry Pi lack the arithmetic support for 16-bit floating point and 8-bit integers. |
http://raspbian.raspberrypi.com/raspbian/pool/main/c/clblast/ i asked about hard-float RPi earlier, not realizing that hf in this community means hugging face. RPi in 32-bit/armhf might be more capable? https://cdimage.ubuntu.com/releases/22.04.4/release/ https://launchpad.net/ubuntu/jammy/+source/clblast natively available in 64 and 32 bit distributions https://en.wikipedia.org/wiki/ARM_architecture_family#Floating-point_(VFP):
https://en.wikipedia.org/wiki/IEEE_754 interesting comparison: |
The first version is implemented, probably it may require some adjusments. I cannot observe any acceleration on my Mac but on strong GPUs it may be visible. I added a description how to try to run it here. |
Awesome! |
Had to adjust a couple of things to get it to compile on windows Installing Vulkan SDK on windows sets the environment variable VK_SDK_PATH but you still need to make use of it in the makefile so that the #include <vulkan/vulkan.h> can correctly find the header files, and also so the linked can link to the vulkan libraries,
With that I'm able to compile it, though haven't tested it yet, going to do that in the morning |
Hi @b4rtaz
I was tinkering a bit over the weekend and figured it might be possible to create a version of worker/main that accelerates the inference by offloading some work to the GPU to handle.
I've never really worked with compute shaders or Vulkan for that matter but I put together a simple demo that successfully ran a compute shader using Vulkan
The compute shader currently just takes an input buffer and copies the data to an output buffer.
This is what I have so far
compute-shader-example.zip
My next step is to upgrade it to do a matmul on two matrices and do the same operation on CPU and compare the results, I'm hopeful that I could utilize the worker/root node's dedicated/integrated GPU to do the heavy lifting.
I'll do some experiments on my fork on integrating it once I have a matmul compute shader working and let you know how it goes.
Right now I just want to get something working where I give it two matrices and it computes the resulting matmul output.
The text was updated successfully, but these errors were encountered: