You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
printf(" --allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit\n");
112
113
printf(" --leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing\n");
113
114
printf(" --pure: Disable k-quant mixtures and quantize all tensors to the same type\n");
114
115
printf(" --imatrix file_name: use data in file_name as importance matrix for quant optimizations\n");
115
116
printf(" --include-weights tensor_name: use importance matrix for this/these tensor(s)\n");
116
117
printf(" --exclude-weights tensor_name: use importance matrix for this/these tensor(s)\n");
117
-
printf(" --output-tensor-type ggml_type: use this ggml_type for the output.weight tensor\n");
118
-
printf(" --token-embedding-type ggml_type: use this ggml_type for the token embeddings tensor\n");
118
+
printf(" --output-tensor-type ggml_type: use this ggml_type for the output.weight tensor.\n");
119
+
printf(" --token-embedding-type ggml_type: use this ggml_type for the token_embd.weight tensor.\n\n");
120
+
printf("Additional specific tensor quantization types used in the custom quant scheme 'CQS (default is Q2_K):\n");
121
+
printf(" --attn-q-type ggml_type: use this ggml_type for the attn_q.weight tensor.\n");
122
+
printf(" --attn-k-type ggml_type: use this ggml_type for the attn_k.weight tensor.\n");
123
+
printf(" --attn-v-type ggml_type: use this ggml_type for the attn_v.weight tensor.\n");
124
+
printf(" --attn-qkv-type ggml_type: use this ggml_type for the attn_qkv.weight tensor.\n");
125
+
printf(" --attn-output-type ggml_type: use this ggml_type for the attn_output.weight tensor.\n");
126
+
printf(" --ffn-gate-type ggml_type: use this ggml_type for the ffn_gate tensor.\n");
127
+
printf(" --ffn-down-type ggml_type: use this ggml_type for the ffn_down tensor.\n");
128
+
printf(" --ffn-up-type ggml_type: use this ggml_type for the ffn_up tensor.\n\n");
119
129
printf(" --keep-split: will generate quantized model in the same shards as input\n");
120
130
printf(" --override-kv KEY=TYPE:VALUE\n");
121
-
printf(" Advanced option to override model metadata by key in the quantized model. May be specified multiple times.\n");
131
+
printf(" Advanced option to override model metadata by key in the quantized model. May be specified multiple times.\n\n");
122
132
printf("Note: --include-weights and --exclude-weights cannot be used together\n");
133
+
printf("Note: The token embeddings tensor is loaded in system RAM, even in case of full GPU/VRAM offload.\n");
134
+
printf("Note: The recommanded type for the output tensor is q6_K for the ffn types > iq3_xxs and < q8_0.\n\n");
135
+
printf("Note for the Custom Quant Scheme FTYPE:\n");
136
+
printf(" Write the specific tensor legacy quants as qN_N, the K-Quants as qN_K, the IQ-Quants as iqN_xx.\n");
137
+
printf(" Usually, attn-q-type can be one type below the chosen ffn type, and attn-v-type should be one type above.\n");
138
+
printf(" attn-qkv-type replaces the types attn-q, attn-k and attn-v on some models.\n");
139
+
//TODO: - eventually - harmonize the CAPS writing of the FTYPEs, and non CAPS writing of the GGML_TYPEs.
0 commit comments