update gguf docs (#794)

metascroy · malfet · commit 594414fef7be · 2024-07-17T09:55:46.000-07:00
diff --git a/docs/ADVANCED-USERS.md b/docs/ADVANCED-USERS.md
@@ -132,22 +132,10 @@ GGUF model with the option `--load-gguf ${MODELNAME}.gguf`. Presently,
 the F16, F32, Q4_0, and Q6_K formats are supported and converted into
 native torchchat models.
 
-You may also dequantize GGUF models with the GGUF quantize tool, and
-then load and requantize with torchchat native quantization options.
-
 | GGUF Model | Tested | Eager | torch.compile | AOT Inductor | ExecuTorch | Fits on Mobile |
 |-----|--------|-------|-----|-----|-----|-----|
 | llama-2-7b.Q4_0.gguf |  🚧 | 🚧 | 🚧 | 🚧 | 🚧 |
 
-You may also dequantize GGUF models with the GGUF quantize tool, and
-then load and requantize with torchchat native quantization options.
-
-**Please note that quantizing and dequantizing is a lossy process, and
-you will get the best results by starting with the original
-unquantized model checkpoint, not a previously quantized and then
-dequantized model.**
-
-
 ## Conventions used in this document
 
 We use several variables in this example, which may be set as a
@@ -232,7 +220,7 @@ submission guidelines.)
 
 Torchchat supports several devices.  You may also let torchchat use
 heuristics to select the best device from available devices using
-torchchat's virtual device named `fast`. 
+torchchat's virtual device named `fast`.
 
 Torchchat supports execution using several floating-point datatypes.
 Please note that the selection of execution floating point type may
@@ -398,9 +386,9 @@ linear operator (asymmetric) with GPTQ | n/a | 4b (group) | n/a |
 linear operator (asymmetric) with HQQ | n/a |  work in progress | n/a |
 
 ## Model precision (dtype precision setting)
-On top of quantizing models with quantization schemes mentioned above, models can be converted 
-to lower precision floating point representations to reduce the memory bandwidth requirement and 
-take advantage of higher density compute available. For example, many GPUs and some of the CPUs 
+On top of quantizing models with quantization schemes mentioned above, models can be converted
+to lower precision floating point representations to reduce the memory bandwidth requirement and
+take advantage of higher density compute available. For example, many GPUs and some of the CPUs
 have good support for bfloat16 and float16. This can be taken advantage of via `--dtype arg` as shown below.
 
 [skip default]: begin
@@ -439,30 +427,6 @@ may dequantize them using GGUF tools, and then laod the model into
 torchchat to quantize with torchchat's quantization workflow.)
 
 
-## Loading unsupported GGUF formats in torchchat
-
-GGUF formats not presently supported natively in torchchat may be
-converted to one of the supported formats with GGUF's
-`${GGUF}/quantize` utility to be loaded in torchchat. If you convert
-to the FP16 or FP32 formats with GGUF's `quantize` utility, you may
-then requantize these models with torchchat's quantization workflow.
-
-**Note that quantizing and dequantizing is a lossy process, and you will
-get the best results by starting with the original unquantized model
-checkpoint, not a previously quantized and then dequantized
-model.** Thus, while you can convert your q4_1 model to FP16 or FP32
-GGUF formats and then requantize, you might get better results if you
-start with the original FP16 or FP32 GGUF format.
-
-To use the quantize tool, install the GGML tools at ${GGUF} . Then,
-you can, for example, convert a quantized model to f16 format:
-
-[end default]: end
-```
-${GGUF}/quantize --allow-requantize your_quantized_model.gguf fake_unquantized_model.gguf f16
-```
-
-
 ## Optimizing your model for server, desktop and mobile devices
 
 While we have shown the export and execution of a small model on CPU
diff --git a/docs/GGUF.md b/docs/GGUF.md
@@ -56,7 +56,6 @@ python3 torchchat.py generate --gguf-path ${GGUF_MODEL_PATH} --dso-path ${GGUF_S
 
 ```
 
-
 ### ExecuTorch export + generate
 Before running this example, you must first [Set-up ExecuTorch](executorch_setup.md).
 ```
@@ -67,4 +66,28 @@ python3 torchchat.py export --gguf-path ${GGUF_MODEL_PATH} --output-pte-path ${G
 python3 torchchat.py generate --gguf-path ${GGUF_MODEL_PATH} --pte-path ${GGUF_PTE_PATH} --tokenizer-path ${GGUF_TOKENIZER_PATH} --temperature 0 --prompt "Once upon a time" --max-new-tokens 15
 ```
 
+### Advanced: loading unsupported GGUF formats in torchchat
+GGUF formats not presently supported natively in torchchat can be
+converted to one of the supported formats with GGUF's
+[quantize](https://github.com/ggerganov/llama.cpp/tree/master/examples/quantize) utility.
+If you convert to the FP16 or FP32 formats with GGUF's quantize utility, you can
+then requantize these models with torchchat's native quantization workflow.
+
+**Please note that quantizing and dequantizing is a lossy process, and
+you will get the best results by starting with the original
+unquantized model, not a previously quantized and then
+dequantized model.**
+
+As an example, suppose you have [llama.cpp cloned and installed](https://github.com/ggerganov/llama.cpp) at ~/repos/llama.cpp.
+You can then convert a model to FP16 with the following command:
+
+[skip default]: begin
+```
+~/repos/llama.cpp/quantize --allow-requantize path_of_model_you_are_converting_from.gguf path_for_model_you_are_converting_to.gguf fp16
+```
+[skip default]: end
+
+After the model is converted to a supported format like FP16, you can proceed using the instructions above.
+
+
 [end default]: end