Replies: 2 comments 5 replies
-
What kind of documentation are you thinking about? What the flag does is very straightforward, it lets you choose where to store the tensors of model. I don't think this needs a lot of explanation. Beyond that, my view is that it is up to the community to find interesting ways to take advantage of this. Of course, that requires you to understand what tensors are in a model, and where they may be stored. This is inherently a very low level feature and most people probably shouldn't be using it, but maybe it would be nice to have some documentation about typical use cases, and maybe in the future we can add simpler flags to enable them once it is clear what the applications are. I am not aware of any cases where using this flag causes the the model to produce incorrect result. That should not happen, if you find a case open a issue about it. |
Beta Was this translation helpful? Give feedback.
-
I took another swing at this override stuff with Llama scout and think I finally made some progress. My first couple tries I was not selecting the right tensors to send to CPU. To find the set of tensor names do a dummy server run with
This will override every tensor in the model to the CPU so you scan see the names. For Scout you should get a list something like this :
There are 3 tensors every layer for experts and 3 tensors every layer for shared expert. These expert tensors need to be offloaded to CPU for all layers:
These experts contain the majority of the parameters for the models. Scout has about 6G/expert * 16 experts = 96G parameters. The shared experts are
These shared experts run every token and need to be all on GPU for efficiency. For the entire model the shared experts take 6G of space and can easily fit in GPU. So experts (96G) + shared expert (6G) brings model parameters to 102G. There are another 6G total parameters in the rest of the tensors for the model to bring the whole thing to 108G. Thus shared + rest = 12G of parameters. This 12G of parameters can offload easily to a 12G GPU with a lot of room left over when quantizing to <4b/param. This command will pattern match all the experts and send them to CPU:
Now load model with NGL=99 and the 6G params of shared experts and 6G params of other stuff should easily fit into single GPU. On my 4070 with q8_0 KV there is enough GPU VRAM left over for a 45k token KV cache (using my Q2_K_H hybrid quant https://huggingface.co/steampunque/Llama-4-Scout-17B-16E-Instruct-GGUF/resolve/main/Llama-4-Scout-17B-16E-Instruct.Q2_K_H.gguf to really push down the parameters of the model itself). Using this I get 8.9ts with a single GPU, about 1 to 2t/s slower than if I offload layers to 3 RPC machines. The huge advantage with this approach is I don't need to use RPC at all, I have a 108B parameter model running on one commodity 4070 and supporting 45k KV. The bottleneck in gen speed for me is I only have a 9900k CPU with DDR4 memory so it doesn't have a lot of mem BW, but it only needs to effectively evaluate 6G params per token on CPU which is not too bad. So I think this is actually the best approach to run the model instead of trying to offload it to a bunch of GPUs with RPC or otherwise. |
Beta Was this translation helpful? Give feedback.
-
Llama-server, llama-cli version 5184 (still in effect for the latest build b5205)
The override tensor flag has become increasingly popular as a cost effective way to run Mixture of Expert models (particularly Deepseek style models with a shared expert that can very reliably target the GPU for a low VRAM overhead), because it allows for very cost effective single-user hybrid CPU / GPU inference.
As an example, on a Ryzen 9950X (192GB 4400MHZ RAM), RTX 4000 SFF + RTX 2000 ADA GPU, I'm able to run Maverick at anywhere between a q4 and q6 quant at roughly 10 tokens per second (around 33t/s prompt processing with
-ubatch-size 4
), by explicitly leaving all conditional experts on CPU. If I wanted to run a model that feels roughly as intelligent, I'd be looking at probably a 70B llama 3.3 finetune which runs at about 1.7 tokens per second (even with an optimized speculative decoding setup). I suspect it may even become something of a meta, similar to the P40 of early consumer LLM deployments, where hobbyists would pick up four P40s for less than the price of a 3090 for single user local LLM deployments.A major issue atm is that the flag is wildly undocumented, and the only documentation was around the PR of the feature itself.
#11397
I'd like to place the argument forward that given the release of recent high profile MoE models such as Scout, and Maverick, which will likely proceed to be more of a platform than a pair of models similar to how Llama 3.x became a platform for finetunes, or the Ling MoE (and presumably Qwen 3 MoE series which recently leaked slightly ahead of launch), we're likely to see a lot more cases where somebody may want to be very intentional about the placement of tensors on their available hardware, and it's important to understand how
-ot
works as a major part of that strategy.To that end: I'm hoping to foster some degree of discussion about this issue in this thread. I'm not sure if it's appropriate to file this under a formal issue, so I've started with a discussion on it.
Additionally: Prior to the fixes for the Llama 4 series there was another major issue with the -ot command: It broke multi-GPU setups, and left me in a weird situation where my primary GPU would see 19.5GB of usage, while my secondary would only see... 5GB (possibly it was only being used for prompt processing?). Recent updates have fixed this, but having no documentation on the functionality and usage of the flag makes it difficult to troubleshoot issues like this, and I currently have no idea what fix it was exactly that fixed multi-GPU usage. I think leaving edge-case functionality to luck is not necessarily the best case scenario for end users, so I kindly hope that contributors will pitch in to this discussion.
Beta Was this translation helpful? Give feedback.
All reactions