Insufficient documentation for -ot and --override-tensor flag #13154

Juahyori · 2025-04-28T17:15:06Z

Juahyori
Apr 28, 2025

Llama-server, llama-cli version 5184 (still in effect for the latest build b5205)

The override tensor flag has become increasingly popular as a cost effective way to run Mixture of Expert models (particularly Deepseek style models with a shared expert that can very reliably target the GPU for a low VRAM overhead), because it allows for very cost effective single-user hybrid CPU / GPU inference.

As an example, on a Ryzen 9950X (192GB 4400MHZ RAM), RTX 4000 SFF + RTX 2000 ADA GPU, I'm able to run Maverick at anywhere between a q4 and q6 quant at roughly 10 tokens per second (around 33t/s prompt processing with -ubatch-size 4 ), by explicitly leaving all conditional experts on CPU. If I wanted to run a model that feels roughly as intelligent, I'd be looking at probably a 70B llama 3.3 finetune which runs at about 1.7 tokens per second (even with an optimized speculative decoding setup). I suspect it may even become something of a meta, similar to the P40 of early consumer LLM deployments, where hobbyists would pick up four P40s for less than the price of a 3090 for single user local LLM deployments.

A major issue atm is that the flag is wildly undocumented, and the only documentation was around the PR of the feature itself.

#11397

I'd like to place the argument forward that given the release of recent high profile MoE models such as Scout, and Maverick, which will likely proceed to be more of a platform than a pair of models similar to how Llama 3.x became a platform for finetunes, or the Ling MoE (and presumably Qwen 3 MoE series which recently leaked slightly ahead of launch), we're likely to see a lot more cases where somebody may want to be very intentional about the placement of tensors on their available hardware, and it's important to understand how -ot works as a major part of that strategy.

To that end: I'm hoping to foster some degree of discussion about this issue in this thread. I'm not sure if it's appropriate to file this under a formal issue, so I've started with a discussion on it.

Additionally: Prior to the fixes for the Llama 4 series there was another major issue with the -ot command: It broke multi-GPU setups, and left me in a weird situation where my primary GPU would see 19.5GB of usage, while my secondary would only see... 5GB (possibly it was only being used for prompt processing?). Recent updates have fixed this, but having no documentation on the functionality and usage of the flag makes it difficult to troubleshoot issues like this, and I currently have no idea what fix it was exactly that fixed multi-GPU usage. I think leaving edge-case functionality to luck is not necessarily the best case scenario for end users, so I kindly hope that contributors will pitch in to this discussion.

slaren · 2025-04-29T16:34:28Z

slaren
Apr 29, 2025
Maintainer

What kind of documentation are you thinking about? What the flag does is very straightforward, it lets you choose where to store the tensors of model. I don't think this needs a lot of explanation. Beyond that, my view is that it is up to the community to find interesting ways to take advantage of this. Of course, that requires you to understand what tensors are in a model, and where they may be stored. This is inherently a very low level feature and most people probably shouldn't be using it, but maybe it would be nice to have some documentation about typical use cases, and maybe in the future we can add simpler flags to enable them once it is clear what the applications are.

I am not aware of any cases where using this flag causes the the model to produce incorrect result. That should not happen, if you find a case open a issue about it.

5 replies

Juahyori Apr 30, 2025
Author

I tentatively think that -ot and --override-tensor should be listed as options under Llama-server, llama-cli, and llama-bench README.md files, as the option exists, and people should probably know about it. A basic explanation of the behavior (that it overrides the default assignment of tensors), would be welcome, as well as a note that tensor override is applied last would be helpful.

I've spent quite a bit of time in the past week or so helping people troubleshoot custom tensor override setups to optimize their rigs for -ot to get the most tokens per second out of recent MoE models, and it's really difficult to manage.

The issue is that if you have

[Layer1, attn, shexp, exp]
[Layer 2, attn, shexp, exp]
...
[Layer 46, attn, shexp, exp]
And so on.

If you tensor override just the experts onto CPU, so...

Layer1: GPU[attn, shexp] CPU[exp]
Layer2: GPU[attn, shexp, exp]
Layer20+: GPU2[attn, shexp, exp]

You end up in a situation where --tensor-split actually assigns layers before the -ot flag. This intuitively makes sense if you understand what the flag is, but the issue is, if you're a casual user, you might not necessarily understand in a multi-GPU setup why your tensor-split suddenly is not evenly balanced anymore (I'm not necessarily saying this is an issue with tensor split, itself, just that due to -ot being effectively completely undocumented, it's confusing to a lot of users).

I don't really think that this being a low level feature is necessarily a great reason for there not to be a note that it exists, the use case for it, an example usage, and a link to the PR the implemented it in the major binaries that support it as a flag. I think that a full guide explaining its usage probably is out of scope of the main documentation, and probably should be a community driven effort, but I do think that as it becomes a more popular feature, it's worth at least an acknowledgement that it exists.

I think that general awareness about the flag and functionality is spurring a lot of interesting choices in computer builds and expanding what people can and can't do on a given hardware budget, and I think given sufficient usage of it we may see more ergonomic flags and options that operate in a more intuitive way, like a dedicated --conditional-expert-offload="CPU" flag or something like that, but I think to get that done we probably need people to play with -ot first and figure out the best strategies for a variety of systems, and the way we get people playing with it is by putting a note in the documentation.

dpmm99 May 1, 2025

Concur. It should at least be listed here and here.

As for specifics, depending on the level of user, it could cover:

how to discover the tensor names for a given model
what regex syntax it uses
how to get the necessary ggml struct to use it via the public interface
maybe even what the tensors of certain names generally do

For the record, I actually got a 10x performance difference from using this feature to move just four layers of Qwen3-30B-A3B-Q6_K to my CPU so I wasn't using a tiny sliver of "shared GPU memory" anymore:
-ot "blk\.[3-4][0-9].*=CPU"
eval time = 5892776.34 ms / 7560 tokens ( 779.47 ms per token, 1.28 tokens per second)
-ot "blk\.(2[6-9]|[3-4][0-9]).*=CPU"
eval time = 754064.63 ms / 9580 tokens ( 78.71 ms per token, 12.70 tokens per second)

(Full command line was more like: llama-server -m Qwen_Qwen3-30B-A3B-Q6_K.gguf --port 7861 -c 32768 -b 2048 --gpu-layers 99 --cache-type-k q8_0 --cache-type-v q8_0 --flash-attn + one of those; I also tried -ot exps=CPU)

The public API expecting us to pass in a ggml_backend_buffer_type_t seems pretty odd, too. I was briefly looking at exposing this in LlamaSharp, but there are very few references to ggml across the whole wrapper library.

steampunque May 5, 2025

Well I didnt get much out of this with Lllama Scout. I sent all shared experts to main GPU :

OT='-ot blk.*_shexp*=CUDA0'

and it runs 7 t/s vs 10ts without using ts. If I send all experts to CPU

OT='-ot blk.*_exps*=CPU'

there is just a tiny amount of memory loaded to the 3 GPUS and token speed is unusable. (I have 2 4070s on RPC and one main).

I get 10 t/s if I just set NGL to 34 (out of 49 max) and do not use -ot.

So I guess I second the motion of adding some documentation somewhere on how to use this feature in various setups so it can actually help (I'm guessing moe with both shared and nonshared experts is the primary use case).

Juahyori May 5, 2025
Author

@steampunque
The advantage of the Scout, Maverick, and Deepseek architecture is in my opinion not that the shared expert may be sent to the CPU, but in fact, that it may not.

The thing about MoE, is that it requires a large number of parameters, but a small number of active parameters. This lends itself rather well to CPU, which has commodity access to memory, but limited access to bandwidth and compute.

In contrast, the dense, shared expert, or the Attention prefer bandwidth and compute to raw capacity.

I personally use

-ngl 99
-ot "\d+.ffn_.*_exps.=CPU"

Which moves all conditional experts to CPU, but leaves all other parameters, notably the shared expert on GPU, as well as Attention. At 16k context, this results in around 7.8GB and 5.4GB of VRAM usage on either of my GPUs respectively, using the Unsloth q4_k_xl quantization. I will note that it appears your first command may have left the attention on CPU, as well.

In the case that you have significantly more GPU (such as a build meant to run around 70B - 100B models on GPU), it can be necessary to specify specific layers to offload to CPU until you hit maximum VRAM usage (as in, to only offload the layers that you need to to fit the model)

In practice, I get around 10 tokens per second on Maverick.

Ryzen 9950X (compiled with BLAS)
DDR5 4400MHZ dual channel memory (192GB)
RTX 4000 SFF
RTX 2000 ADA
Layer split (pipeline parallel)
Model stored on a gen 5 NVMe (relevant because I don't have sufficient RAM to load the model, meaning that conditional experts are streamed).

You will, to reiterate, be limited to roughly the speeds that you would get running around a 9B parameter LLM on its own, on CPU, to the best of my observation.

steampunque May 6, 2025

@steampunque The advantage of the Scout, Maverick, and Deepseek architecture is in my opinion not that the shared expert may be sent to the CPU, but in fact, that it may not.

Yes I agree with that which is why my first try was to force all the shared experts onto GPU since they run every token.

The thing about MoE, is that it requires a large number of parameters, but a small number of active parameters. This lends itself rather well to CPU, which has commodity access to memory, but limited access to bandwidth and compute.

Problem is that if you have a model like Scout which still has a fairly large number of active parameters they are all going to have to get evaluated on the CPU. So if you kick all the gated experts to CPU you force the CPU to compute all those 17B parameters. I don't understand the math with Scout... they say 16 experts of 17B but that doesnt add up to 108B parameters so there must be some overlap. Nonetheless a 17B per token load is still very heavy for a CPU.

I personally use

-ngl 99
-ot "\d+.ffn_.*_exps.=CPU"

I am not sure what that above is doing. Maybe scout has different naming conventions. I found the layer names by
setting

OT='-ot .*=CPU'

then launch server with -v for verbosse and it will list all the tensors in the model being sent to CPU so you can get the names.
Then I singled out the gated ffn as follows:

OT='-ot blk.*.ffn_gate_inp*=CPU  -ot blk.*.ffn_gate_exps*=CPU'

I was then able to set NGL to 49 (max scout offload) and I got almost a perfect offload, all GPUs close to capacity with enough room for usable KV. I have 3 GPUs I am offloading to (2 are RPC).

However when I run this my token gen drops from 10-11 t/s I get with just offloading 34/49 to my 3 GPUs down to 6 t/s. I think that is partially because I don't have a super strong CPU (just a 9900k). So I think this idea working out is contingent on much smaller experts than Scout is using (17B is way too big for my CPU) + a strong CPU together. I think my only option is to hook up more RPC boxes on my network and get the whole thing into VRAM for faster gen speeds.

Ryzen 9950X (compiled with BLAS) DDR5 4400MHZ dual channel memory (192GB) RTX 4000 SFF RTX 2000 ADA Layer split (pipeline parallel) Model stored on a gen 5 NVMe (relevant because I don't have sufficient RAM to load the model, meaning that conditional experts are streamed).

Thats a strong CPU. I think that is why its working out for you.

You will, to reiterate, be limited to roughly the speeds that you would get running around a 9B parameter LLM on its own, on CPU, to the best of my observation.

For Scout its 17B total, total bust for me even if the shared expert is half of that its still 9B on CPU which will crawl on my machine.

Thank you for your help and very instructive comments on this topic.

steampunque · 2025-05-07T18:56:45Z

steampunque
May 7, 2025

I took another swing at this override stuff with Llama scout and think I finally made some progress. My first couple tries I was not selecting the right tensors to send to CPU.

To find the set of tensor names do a dummy server run with

 NGL=99 OT='-ot .*=CPU' and also pass in the -v flag (verbose).

This will override every tensor in the model to the CPU so you scan see the names. For Scout you should get a list something like this :

tensor token_embd.weight buffer type overriden to CPU
tensor output_norm.weight buffer type overriden to CPU
tensor output.weight buffer type overriden to CPU
tensor blk.0.attn_norm.weight buffer type overriden to CPU
tensor blk.0.attn_q.weight buffer type overriden to CPU
tensor blk.0.attn_k.weight buffer type overriden to CPU
tensor blk.0.attn_v.weight buffer type overriden to CPU
tensor blk.0.attn_output.weight buffer type overriden to CPU
tensor blk.0.ffn_norm.weight buffer type overriden to CPU
tensor rope_freqs.weight buffer type overriden to CPU
tensor blk.0.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.0.ffn_gate_exps.weight buffer type overriden to CPU
tensor blk.0.ffn_down_exps.weight buffer type overriden to CPU
tensor blk.0.ffn_up_exps.weight buffer type overriden to CPU
tensor blk.0.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.0.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.0.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.1.attn_norm.weight buffer type overriden to CPU
tensor blk.1.attn_q.weight buffer type overriden to CPU
tensor blk.1.attn_k.weight buffer type overriden to CPU
tensor blk.1.attn_v.weight buffer type overriden to CPU
tensor blk.1.attn_output.weight buffer type overriden to CPU
tensor blk.1.ffn_norm.weight buffer type overriden to CPU
tensor rope_freqs.weight buffer type overriden to CPU
tensor blk.1.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.1.ffn_gate_exps.weight buffer type overriden to CPU
tensor blk.1.ffn_down_exps.weight buffer type overriden to CPU
tensor blk.1.ffn_up_exps.weight buffer type overriden to CPU
tensor blk.1.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.1.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.1.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.2.attn_norm.weight buffer type overriden to CPU
tensor blk.2.attn_q.weight buffer type overriden to CPU
tensor blk.2.attn_k.weight buffer type overriden to CPU
tensor blk.2.attn_v.weight buffer type overriden to CPU
tensor blk.2.attn_output.weight buffer type overriden to CPU
tensor blk.2.ffn_norm.weight buffer type overriden to CPU
tensor rope_freqs.weight buffer type overriden to CPU
tensor blk.2.ffn_gate_inp.weight buffer type overriden to CPU
tensor blk.2.ffn_gate_exps.weight buffer type overriden to CPU
tensor blk.2.ffn_down_exps.weight buffer type overriden to CPU
tensor blk.2.ffn_up_exps.weight buffer type overriden to CPU
tensor blk.2.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.2.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.2.ffn_up_shexp.weight buffer type overriden to CPU
tensor blk.3.attn_norm.weight buffer type overriden to CPU
tensor blk.3.attn_q.weight buffer type overriden to CPU
tensor blk.3.attn_k.weight buffer type overriden to CPU
tensor blk.3.attn_v.weight buffer type overriden to CPU
tensor blk.3.attn_output.weight buffer type overriden to CPU
tensor blk.3.ffn_norm.weight buffer type overriden to CPU

There are 3 tensors every layer for experts and 3 tensors every layer for shared expert. These expert tensors need to be offloaded to CPU for all layers:

tensor blk.2.ffn_gate_exps.weight buffer type overriden to CPU
tensor blk.2.ffn_down_exps.weight buffer type overriden to CPU
tensor blk.2.ffn_up_exps.weight buffer type overriden to CPU

These experts contain the majority of the parameters for the models. Scout has about 6G/expert * 16 experts = 96G parameters.

The shared experts are

tensor blk.2.ffn_gate_shexp.weight buffer type overriden to CPU
tensor blk.2.ffn_down_shexp.weight buffer type overriden to CPU
tensor blk.2.ffn_up_shexp.weight buffer type overriden to CPU

These shared experts run every token and need to be all on GPU for efficiency. For the entire model the shared experts take 6G of space and can easily fit in GPU. So experts (96G) + shared expert (6G) brings model parameters to 102G. There are another 6G total parameters in the rest of the tensors for the model to bring the whole thing to 108G. Thus shared + rest = 12G of parameters. This 12G of parameters can offload easily to a 12G GPU with a lot of room left over when quantizing to <4b/param.

This command will pattern match all the experts and send them to CPU:

OT='-ot blk.*_exps*=CPU'

Now load model with NGL=99 and the 6G params of shared experts and 6G params of other stuff should easily fit into single GPU. On my 4070 with q8_0 KV there is enough GPU VRAM left over for a 45k token KV cache (using my Q2_K_H hybrid quant https://huggingface.co/steampunque/Llama-4-Scout-17B-16E-Instruct-GGUF/resolve/main/Llama-4-Scout-17B-16E-Instruct.Q2_K_H.gguf to really push down the parameters of the model itself). Using this I get 8.9ts with a single GPU, about 1 to 2t/s slower than if I offload layers to 3 RPC machines. The huge advantage with this approach is I don't need to use RPC at all, I have a 108B parameter model running on one commodity 4070 and supporting 45k KV. The bottleneck in gen speed for me is I only have a 9900k CPU with DDR4 memory so it doesn't have a lot of mem BW, but it only needs to effectively evaluate 6G params per token on CPU which is not too bad. So I think this is actually the best approach to run the model instead of trying to offload it to a bunch of GPUs with RPC or otherwise.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Insufficient documentation for -ot and --override-tensor flag #13154

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Insufficient documentation for -ot and --override-tensor flag #13154

Juahyori Apr 28, 2025

Replies: 2 comments · 5 replies

slaren Apr 29, 2025 Maintainer

Juahyori Apr 30, 2025 Author

dpmm99 May 1, 2025

steampunque May 5, 2025

Juahyori May 5, 2025 Author

steampunque May 6, 2025

steampunque May 7, 2025

Juahyori
Apr 28, 2025

Replies: 2 comments 5 replies

slaren
Apr 29, 2025
Maintainer

Juahyori Apr 30, 2025
Author

Juahyori May 5, 2025
Author

steampunque
May 7, 2025