Mention ICML accept and PV-Tuning (Vahe1994#100)

This PR makes several changes to the README.md file * mentions AQLM presence on ICML'2024 * mentions PV tuning in the header and references pv-tuning code * lists several pre-quantized AQLM models finetuned with pv-tuning * fixes several minor typos
nyunAI · Jun 6, 2024 · 7a09970 · 7a09970
1 parent de6c1bd
commit 7a09970
Showing 1 changed file with 43 additions and 5 deletions.
diff --git a/README.md b/README.md
@@ -2,6 +2,10 @@
 
 Official PyTorch implementation for [Extreme Compression of Large Language Models via Additive Quantization](https://arxiv.org/pdf/2401.06118.pdf)
 
+**[2024.05]** AQLM was accepted to [ICML'2024](https://icml.cc/Conferences/2024)! If you're attending, meet us around [this poster](https://icml.cc/virtual/2024/poster/34964).
+
+**[2024.06]** there's a more effective way to tune quantized models with [PV-tuning](https://arxiv.org/abs/2405.14852)). We're releasing PV-tuned AQLM models [**in this collection**](https://huggingface.co/collections/ISTA-DASLab/aqlmpv-66564dff5d84f00a893ba93f) and the code is in the [pv-tuning branch](https://github.com/Vahe1994/AQLM/tree/pv-tuning). We'll merge the pv-tuning code into main after several technical improvements.
+
 ## Inference
 
 ### Demo
@@ -20,7 +24,7 @@ The models reported below use **full model fine-tuning** as described in appendi
 
 We provide a number of prequantized models:
 
-| Model      | AQLM scheme | WikiText 2 PPL | MMLU (5-shot) FP16→AQLM | Model size, Gb | Hub link                                                                 |
+| Model      | AQLM scheme | WikiText-2 PPL | MMLU (5-shot) FP16→AQLM | Model size, Gb | Hub link                                                                 |
 |------------|-------------|----------------|---------------|----------------|--------------------------------------------------------------------------|
 | Llama-3-8b | 1x16        | -          | 0.65→0.56 | 4.1            | [Link](https://huggingface.co/ISTA-DASLab/Meta-Llama-3-8B-AQLM-2Bit-1x16) |
 | Llama-3-8b-Instruct | 1x16        | -          | 0.66→0.59 | 4.1            | [Link](https://huggingface.co/ISTA-DASLab/Meta-Llama-3-8B-Instruct-AQLM-2Bit-1x16) |
@@ -42,9 +46,32 @@ We provide a number of prequantized models:
 | gemma-2b | 1x16      | -           | - | 1.7            | [Link](https://huggingface.co/ISTA-DASLab/gemma-2b-AQLM-2Bit-1x16-hf)|
 | gemma-2b | 2x8      | -           | - | 1.6            | [Link](https://huggingface.co/ISTA-DASLab/gemma-2b-AQLM-2Bit-2x8-hf)|
 
+You can also download AQLM models tuned via PV-tuning:
+
+| Model      | AQLM scheme | WikiText-2 PPL | Model size, Gb | Hub link                                                                 |
+|------------|-------------|----------------|----------------|--------------------------------------------------------------------------|
+| Llama-2-7b | 1x16g8        | 5.68          | 2.4            | [Link](https://huggingface.co/ISTA-DASLab/Llama-2-7b-AQLM-PV-2Bit-1x16-hf) |
+| Llama-2-7b | 2x8g8         | 5.90          | 2.2            | [Link](https://huggingface.co/ISTA-DASLab/Llama-2-7b-AQLM-PV-2Bit-2x8-hf)  |
+| Llama-2-7b | 1x16g16     | 9.21          | 1.7            | [Link](https://huggingface.co/justheuristic/Llama-2-7b-AQLM-PV-1Bit-1x16-hf)  |
+| Llama-2-13b| 1x16g8        | 5.05           | 4.1            | [Link](https://huggingface.co/ISTA-DASLab/Llama-2-13b-AQLM-PV-2Bit-1x16-hf)|
+| Llama-2-70b| 1x16g8        | 3.78           | 18.8           | [Link](https://huggingface.co/ISTA-DASLab/Llama-2-70b-AQLM-PV-2Bit-1x16-hf)|
+| Meta-Llama-3-8B | 1x16g8        | 6.99          | 4.1            | [Link](https://huggingface.co/ISTA-DASLab/Meta-Llama-3-8B-AQLM-PV-2Bit-1x16) |
+| Meta-Llama-3-8B  | 1x16g16        | 9.43          | 3.9            | [Link](https://huggingface.co/ISTA-DASLab/Meta-Llama-3-8B-AQLM-PV-1Bit-1x16) |
+| Meta-Llama-3-70B | 1x16g8        | 4.57           | 21.9           | [Link](https://huggingface.co/ISTA-DASLab/Meta-Llama-3-70B-AQLM-PV-2Bit-1x16)|
+| Meta-Llama-3-70B | 1x16g16        | 8.67           | 13           | [Link](https://huggingface.co/ISTA-DASLab/Meta-Llama-3-70B-AQLM-PV-2Bit-1x16)|
+| Meta-Llama-3-70B | 1x16g16        | 8.67           | 13           | [Link](https://huggingface.co/ISTA-DASLab/Meta-Llama-3-70B-AQLM-PV-2Bit-1x16)|
+| Mistral-7B-v0.1 | 1x16g8  | 5.22 | 2.51 | [Link](https://huggingface.co/ISTA-DASLab/Mistral-7B-v0.1-AQLM-PV-2Bit-1x16-hf) |
+| Phi-3-mini-4k-instruct | 1x16g8 | 6.63 | 1.4 | [Link](https://huggingface.co/ISTA-DASLab/Phi-3-mini-4k-instruct-AQLM-PV-2Bit-1x16-hf) |
+
+
+
+Note that models with "gs16" in their scheme require aqlm inference library v1.1.6 or newer: `pip install aqlm[gpu,cpu]>=1.1.6`
+
+Above perplexity is evaluated on **4k** context length for Llama 2 models and **8k** for Mistral/Mixtral and Llama 3. 
+Please also note that token-level perplexity can only be compared within the same model family, but should not be compared between models that use different vocabularies.
+While Mistral has a lower perplexity than Llama 3 8B but this does not mean that Mistral is better: Llama's perplexity is computed on a much larger dictionary and has higher per-token perplexity because of that.
 
-Above perplexity is evaluated on **4k** context length for Llama-2 models and **8k** for Mistral/Mixtral. 
-Please see more evaluation results on the model pages.
+For more evaluation results and detailed explanations, please see our papers: [Egiazarian et al. (2024)](https://arxiv.org/abs/2401.06118) for pure AQLM and [Malinovskii et al. (2024)](https://arxiv.org/abs/2405.14852) for PV-Tuned models.
 
 ### Inference kernels
 
@@ -109,7 +136,7 @@ To load it, use:
 ```python
 from huggingface_hub import hf_hub_download
 
-hf_hub_download(repo_id="Vahe1994/AQLM", filename="data/name.pth",repo_type="dataset")
+hf_hub_download(repo_id="Vahe1994/AQLM", filename="data/name.pth", repo_type="dataset")
 ```
 
 To use downloaded data from HF, place it in data folder(optional) and set correct path to it in "--dataset" argument in main.py.
@@ -204,6 +231,8 @@ There are additional hyperparameters aviailable. Run `python main.py --help` for
 
 ### Finetuning
 
+**Note** this code will only fine-tune continuous parameters. To fine-tune both continuous and discrete parameters, please switch to [pv-tuning](https://github.com/Vahe1994/AQLM/tree/pv-tuning) branch and follow instructions in its readme.
+
 The accuracy of the quantized model can be further improved via block finetuning. First, the logits 
 of the float16/bfloat16 are cached in RAM. Then the differentiable parameters of the quantized model
 are optimized to minimize KL-divergence with teacher logits. Typically, we use the same calibration data that was used for model quantization.
@@ -238,7 +267,8 @@ Main CLI arguments:
 - `--finetune_dtype` - which dtype should be used on finetuning. By default `float32`. 
 - `--amp` - whether to use amp on finetuning. Requires `--finetune_dtype=float32`.
 
-**Note** for larger models one would need multi-GPU training. At the moment, FSDP training is not implemented and the model is finetuned on a single process with parameters sharded across available devices.
+For larger models one would need multi-GPU training. At the moment, FSDP training is not implemented and the model is finetuned on a single process with parameters sharded across available devices.
+
 
 ### Zero-shot benchmarks via LM Evaluation Harness
 
@@ -297,4 +327,12 @@ If you found this work useful, please consider citing:
       archivePrefix={arXiv},
       primaryClass={cs.LG}
 }
+@misc{malinovskii2024pvtuning,
+      title={PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression}, 
+      author={Vladimir Malinovskii and Denis Mazur and Ivan Ilin and Denis Kuznedelev and Konstantin Burlachenko and Kai Yi and Dan Alistarh and Peter Richtarik},
+      year={2024},
+      eprint={2405.14852},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG}
+}
 ```