You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Quantize: specify each major tensor quant in CLI for common LLMs
This PR simply replicates the tensor per tensor custom quantization CLI feature brought by Ikawrakow for the token embeddings and output tensors in ggml-org#6239 to :
- attn_q.weight
- attn_k.weight
- attn_v.weight
- attn_qkv.weight
- attn_output.weight
- ffn_gate
- ffn_down
- ffn_up
This, to allow LlamaCPP users to easily tailor their chosen quant strategy to their needs, but ALSO to allow them to requant easily a quant "a bit too big" for their VRAM in the case of GPU users.
For example, a nice Miqu 70b Q5_K_M (which has no FP16 weight available beyond dequants of Q5_K_M) is short of VRAM in one's pair of 3090s.
And one is French, like me, so Miqu is one of his main local model.
Requanting the Q5_K_M in... Q5_K_M, BUT with all the ffn_down and attn_v.weight tensors specified in Q5_K, and the attn_q.weight specified in Q4_K_M might save you approximatively 1.5GB without degrading too much the quality.
That means 1.3-1.4GB of additional context (yummy with FA and KV Cache) and let's say 100-200MB of additional compute cache with a resonable Blas Batch Size in MMQ.
But also : the unspecified tensors won't be requantized, because LlamaCPP just copy the tensor rather than requantizing it when a specific tensor quant of the chosent strategy is the same than the source.
So one can enjoy the original Miqu quant of these tensors rather than a dequant/requant.
And that's just an example.
I think that many LCPP users could enjoy this feature for their own needs.
This, even if it remains quite basic :
This PR doesn't support hybrid quantization of a tensor (example, with a fraction of the layers in the upper quant (from layer 0 onwards), or the "more_bits" calculus devised by Ikawrakow to create intervals of different quants (ex : 1 layer every 3 layers quantized with the superior quant).
CL example: `llama-quantize --allow-requantize --imatrix Q:\iMatrix\Sheared\princeton-nlp_Sheared-LLaMA-2.7B-AR-b1924-Q8_0.iMatrix_Wiki_c32_ch500.dat --output-tensor-type q4_0 --token-embedding-type q4_0 --attn-q-type q4_0 --attn-k-type q4_0 --attn-v-type q4_0 --attn-output-type q4_0 --ffn-gate-type q4_0 --ffn-down-type q4_0 --ffn-up-type q4_0 D:\text-generation-webui\models\Q8_0\princeton-nlp_Sheared-LLaMA-2.7B-AR-b1924-Q8_0.gguf D:\text-generation-webui\models\princeton-nlp_Sheared-LLaMA-2.7B-AR-b228N.iMatrix_Wiki_c32_ch500-Q5_K_M.gguf Q5_K_M` for a full q4_0 quant equivalent to a pure quant, but specified tensor by tensor.
printf(" --allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit\n");
100
100
printf(" --leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing\n");
101
101
printf(" --pure: Disable k-quant mixtures and quantize all tensors to the same type\n");
102
102
printf(" --imatrix file_name: use data in file_name as importance matrix for quant optimizations\n");
103
103
printf(" --include-weights tensor_name: use importance matrix for this/these tensor(s)\n");
104
-
printf(" --exclude-weights tensor_name: use importance matrix for this/these tensor(s)\n");
105
-
printf(" --output-tensor-type ggml_type: use this ggml_type for the output.weight tensor\n");
106
-
printf(" --token-embedding-type ggml_type: use this ggml_type for the token embeddings tensor\n");
107
-
printf(" --keep-split: will generate quatized model in the same shards as input");
104
+
printf(" --exclude-weights tensor_name: use importance matrix for this/these tensor(s)\n\n");
105
+
printf(" Optional specific tensor quantization types to amend the selected quantization strategy type:\n");
106
+
printf(" --output-tensor-type ggml_type: use this ggml_type for the output.weight tensor.\n");
107
+
printf(" --token-embedding-type ggml_type: use this ggml_type for the token_embd.weight tensor.\n");
108
+
printf(" --attn-q-type ggml_type: use this ggml_type for the attn_q.weight tensor.\n");
109
+
printf(" --attn-k-type ggml_type: use this ggml_type for the attn_k.weight tensor.\n");
110
+
printf(" --attn-v-type ggml_type: use this ggml_type for the attn_v.weight tensor.\n");
111
+
printf(" --attn-qkv-type ggml_type: use this ggml_type for the attn_qkv.weight tensor.\n");
112
+
printf(" --attn-output-type ggml_type: use this ggml_type for the attn_output.weight tensor.\n");
113
+
printf(" --ffn-gate-type ggml_type: use this ggml_type for the ffn_gate tensor.\n");
114
+
printf(" --ffn-down-type ggml_type: use this ggml_type for the ffn_down tensor.\n");
115
+
printf(" --ffn-up-type ggml_type: use this ggml_type for the ffn_up tensor.\n\n");
116
+
printf(" --keep-split: will generate quatized model in the same shards as input\n");
108
117
printf(" --override-kv KEY=TYPE:VALUE\n");
109
-
printf(" Advanced option to override model metadata by key in the quantized model. May be specified multiple times.\n");
118
+
printf(" Advanced option to override model metadata by key in the quantized model. May be specified multiple times.\n\n");
110
119
printf("Note: --include-weights and --exclude-weights cannot be used together\n");
120
+
printf("Note: The token embeddings tensor is loaded in system RAM, even in case of full GPU/VRAM offload.\n");
121
+
printf("Note: The recommanded type for the output tensor is q6_K for the ffn types > iq3_xxs and < q8_0.\n");
122
+
printf("Note: Usually, attn-q-type can be one type below the chosen ffn type, and attn-v-type should be one type above.\n");
123
+
printf("Note: --attn-qkv-type replaces the types attn-q, attn-k, and attn-v on some models.\n");
124
+
printf("Note: Write the specific tensor legacy quants as qN_N, the K-Quants as qN_K, the IQ-Quants as iqN_xx.\n");
125
+
//TODO: - eventually - harmonize the CAPS writing of the FTYPEs, and non CAPS writing of the GGML_TYPEs.
0 commit comments