You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CLI - Specify GGML_TYPE to quantize for the main tensors. (#91)
To complement the token_embd.weight and output.weight :
attn_v.weight
attn_k.weight.
attn_q_weight
attn_output.weight
attn_qkv.weight
ffn_gate
ffn_down
ffn_up
printf(" --allow-requantize: Allows requantizing tensors that have already been quantized. Warning: This can severely reduce quality compared to quantizing from 16bit or 32bit\n");
114
114
printf(" --leave-output-tensor: Will leave output.weight un(re)quantized. Increases model size but may also increase quality, especially when requantizing\n");
115
115
printf(" --pure: Disable k-quant mixtures and quantize all tensors to the same type\n");
116
116
printf(" --imatrix file_name: use data in file_name as importance matrix for quant optimizations\n");
117
117
printf(" --include-weights tensor_name: use importance matrix for this/these tensor(s)\n");
118
118
printf(" --exclude-weights tensor_name: use importance matrix for this/these tensor(s)\n");
119
-
printf(" --output-tensor-type ggml_type: use this ggml_type for the output.weight tensor\n");
120
-
printf(" --token-embedding-type ggml_type: use this ggml_type for the token embeddings tensor\n");
121
-
printf(" --keep-split: will generate quatized model in the same shards as input");
119
+
printf(" --output-tensor-type ggml_type: use this ggml_type for the output.weight tensor.\n");
120
+
printf(" --token-embedding-type ggml_type: use this ggml_type for the token_embd.weight tensor.\n\n");
121
+
printf("Additional specific tensor quantization types used in the custom quant scheme 'CQS (default is Q2_K):\n");
122
+
printf(" --attn-q-type ggml_type: use this ggml_type for the attn_q.weight tensor.\n");
123
+
printf(" --attn-k-type ggml_type: use this ggml_type for the attn_k.weight tensor.\n");
124
+
printf(" --attn-v-type ggml_type: use this ggml_type for the attn_v.weight tensor.\n");
125
+
printf(" --attn-qkv-type ggml_type: use this ggml_type for the attn_qkv.weight tensor.\n");
126
+
printf(" --attn-output-type ggml_type: use this ggml_type for the attn_output.weight tensor.\n");
127
+
printf(" --ffn-gate-type ggml_type: use this ggml_type for the ffn_gate tensor.\n");
128
+
printf(" --ffn-down-type ggml_type: use this ggml_type for the ffn_down tensor.\n");
129
+
printf(" --ffn-up-type ggml_type: use this ggml_type for the ffn_up tensor.\n\n");
130
+
printf(" --keep-split: will generate quantized model in the same shards as input\n");
122
131
printf(" --override-kv KEY=TYPE:VALUE\n");
123
-
printf(" Advanced option to override model metadata by key in the quantized model. May be specified multiple times.\n");
132
+
printf(" Advanced option to override model metadata by key in the quantized model. May be specified multiple times.\n\n");
124
133
printf("Note: --include-weights and --exclude-weights cannot be used together\n");
134
+
printf("Note: The token embeddings tensor is loaded in system RAM, even in case of full GPU/VRAM offload.\n");
135
+
printf("Note: The recommanded type for the output tensor is q6_K for the ffn types > iq3_xxs and < q8_0.\n\n");
136
+
printf("Note for the Custom Quant Scheme FTYPE:\n");
137
+
printf(" Write the specific tensor legacy quants as qN_N, the K-Quants as qN_K, the IQ-Quants as iqN_xx.\n");
138
+
printf(" Usually, attn-q-type can be one type below the chosen ffn type, and attn-v-type should be one type above.\n");
139
+
printf(" attn-qkv-type replaces the types attn-q, attn-k and attn-v on some models.\n");
140
+
//TODO: - eventually - harmonize the CAPS writing of the FTYPEs, and non CAPS writing of the GGML_TYPEs.
0 commit comments