@@ -132,22 +132,10 @@ GGUF model with the option `--load-gguf ${MODELNAME}.gguf`. Presently,
132132the F16, F32, Q4_0, and Q6_K formats are supported and converted into
133133native torchchat models.
134134
135- You may also dequantize GGUF models with the GGUF quantize tool, and
136- then load and requantize with torchchat native quantization options.
137-
138135| GGUF Model | Tested | Eager | torch.compile | AOT Inductor | ExecuTorch | Fits on Mobile |
139136| -----| --------| -------| -----| -----| -----| -----|
140137| llama-2-7b.Q4_0.gguf | 🚧 | 🚧 | 🚧 | 🚧 | 🚧 |
141138
142- You may also dequantize GGUF models with the GGUF quantize tool, and
143- then load and requantize with torchchat native quantization options.
144-
145- ** Please note that quantizing and dequantizing is a lossy process, and
146- you will get the best results by starting with the original
147- unquantized model checkpoint, not a previously quantized and then
148- dequantized model.**
149-
150-
151139## Conventions used in this document
152140
153141We use several variables in this example, which may be set as a
@@ -232,7 +220,7 @@ submission guidelines.)
232220
233221Torchchat supports several devices. You may also let torchchat use
234222heuristics to select the best device from available devices using
235- torchchat's virtual device named ` fast ` .
223+ torchchat's virtual device named ` fast ` .
236224
237225Torchchat supports execution using several floating-point datatypes.
238226Please note that the selection of execution floating point type may
@@ -398,9 +386,9 @@ linear operator (asymmetric) with GPTQ | n/a | 4b (group) | n/a |
398386linear operator (asymmetric) with HQQ | n/a | work in progress | n/a |
399387
400388## Model precision (dtype precision setting)
401- On top of quantizing models with quantization schemes mentioned above, models can be converted
402- to lower precision floating point representations to reduce the memory bandwidth requirement and
403- take advantage of higher density compute available. For example, many GPUs and some of the CPUs
389+ On top of quantizing models with quantization schemes mentioned above, models can be converted
390+ to lower precision floating point representations to reduce the memory bandwidth requirement and
391+ take advantage of higher density compute available. For example, many GPUs and some of the CPUs
404392have good support for bfloat16 and float16. This can be taken advantage of via ` --dtype arg ` as shown below.
405393
406394[ skip default ] : begin
@@ -439,30 +427,6 @@ may dequantize them using GGUF tools, and then laod the model into
439427torchchat to quantize with torchchat's quantization workflow.)
440428
441429
442- ## Loading unsupported GGUF formats in torchchat
443-
444- GGUF formats not presently supported natively in torchchat may be
445- converted to one of the supported formats with GGUF's
446- ` ${GGUF}/quantize ` utility to be loaded in torchchat. If you convert
447- to the FP16 or FP32 formats with GGUF's ` quantize ` utility, you may
448- then requantize these models with torchchat's quantization workflow.
449-
450- ** Note that quantizing and dequantizing is a lossy process, and you will
451- get the best results by starting with the original unquantized model
452- checkpoint, not a previously quantized and then dequantized
453- model.** Thus, while you can convert your q4_1 model to FP16 or FP32
454- GGUF formats and then requantize, you might get better results if you
455- start with the original FP16 or FP32 GGUF format.
456-
457- To use the quantize tool, install the GGML tools at ${GGUF} . Then,
458- you can, for example, convert a quantized model to f16 format:
459-
460- [ end default ] : end
461- ```
462- ${GGUF}/quantize --allow-requantize your_quantized_model.gguf fake_unquantized_model.gguf f16
463- ```
464-
465-
466430## Optimizing your model for server, desktop and mobile devices
467431
468432While we have shown the export and execution of a small model on CPU
0 commit comments