Skip to content

Commit

Permalink
Mention ICML accept and PV-Tuning (Vahe1994#100)
Browse files Browse the repository at this point in the history
This PR makes several changes to the README.md file
* mentions AQLM presence on ICML'2024
* mentions PV tuning in the header and references pv-tuning code
* lists several pre-quantized AQLM models finetuned with pv-tuning
* fixes several minor typos
  • Loading branch information
justheuristic authored Jun 6, 2024
1 parent de6c1bd commit 7a09970
Showing 1 changed file with 43 additions and 5 deletions.
48 changes: 43 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,10 @@

Official PyTorch implementation for [Extreme Compression of Large Language Models via Additive Quantization](https://arxiv.org/pdf/2401.06118.pdf)

**[2024.05]** AQLM was accepted to [ICML'2024](https://icml.cc/Conferences/2024)! If you're attending, meet us around [this poster](https://icml.cc/virtual/2024/poster/34964).

**[2024.06]** there's a more effective way to tune quantized models with [PV-tuning](https://arxiv.org/abs/2405.14852)). We're releasing PV-tuned AQLM models [**in this collection**](https://huggingface.co/collections/ISTA-DASLab/aqlmpv-66564dff5d84f00a893ba93f) and the code is in the [pv-tuning branch](https://github.com/Vahe1994/AQLM/tree/pv-tuning). We'll merge the pv-tuning code into main after several technical improvements.

## Inference

### Demo
Expand All @@ -20,7 +24,7 @@ The models reported below use **full model fine-tuning** as described in appendi

We provide a number of prequantized models:

| Model | AQLM scheme | WikiText 2 PPL | MMLU (5-shot) FP16→AQLM | Model size, Gb | Hub link |
| Model | AQLM scheme | WikiText-2 PPL | MMLU (5-shot) FP16→AQLM | Model size, Gb | Hub link |
|------------|-------------|----------------|---------------|----------------|--------------------------------------------------------------------------|
| Llama-3-8b | 1x16 | - | 0.65→0.56 | 4.1 | [Link](https://huggingface.co/ISTA-DASLab/Meta-Llama-3-8B-AQLM-2Bit-1x16) |
| Llama-3-8b-Instruct | 1x16 | - | 0.66→0.59 | 4.1 | [Link](https://huggingface.co/ISTA-DASLab/Meta-Llama-3-8B-Instruct-AQLM-2Bit-1x16) |
Expand All @@ -42,9 +46,32 @@ We provide a number of prequantized models:
| gemma-2b | 1x16 | - | - | 1.7 | [Link](https://huggingface.co/ISTA-DASLab/gemma-2b-AQLM-2Bit-1x16-hf)|
| gemma-2b | 2x8 | - | - | 1.6 | [Link](https://huggingface.co/ISTA-DASLab/gemma-2b-AQLM-2Bit-2x8-hf)|

You can also download AQLM models tuned via PV-tuning:

| Model | AQLM scheme | WikiText-2 PPL | Model size, Gb | Hub link |
|------------|-------------|----------------|----------------|--------------------------------------------------------------------------|
| Llama-2-7b | 1x16g8 | 5.68 | 2.4 | [Link](https://huggingface.co/ISTA-DASLab/Llama-2-7b-AQLM-PV-2Bit-1x16-hf) |
| Llama-2-7b | 2x8g8 | 5.90 | 2.2 | [Link](https://huggingface.co/ISTA-DASLab/Llama-2-7b-AQLM-PV-2Bit-2x8-hf) |
| Llama-2-7b | 1x16g16 | 9.21 | 1.7 | [Link](https://huggingface.co/justheuristic/Llama-2-7b-AQLM-PV-1Bit-1x16-hf) |
| Llama-2-13b| 1x16g8 | 5.05 | 4.1 | [Link](https://huggingface.co/ISTA-DASLab/Llama-2-13b-AQLM-PV-2Bit-1x16-hf)|
| Llama-2-70b| 1x16g8 | 3.78 | 18.8 | [Link](https://huggingface.co/ISTA-DASLab/Llama-2-70b-AQLM-PV-2Bit-1x16-hf)|
| Meta-Llama-3-8B | 1x16g8 | 6.99 | 4.1 | [Link](https://huggingface.co/ISTA-DASLab/Meta-Llama-3-8B-AQLM-PV-2Bit-1x16) |
| Meta-Llama-3-8B | 1x16g16 | 9.43 | 3.9 | [Link](https://huggingface.co/ISTA-DASLab/Meta-Llama-3-8B-AQLM-PV-1Bit-1x16) |
| Meta-Llama-3-70B | 1x16g8 | 4.57 | 21.9 | [Link](https://huggingface.co/ISTA-DASLab/Meta-Llama-3-70B-AQLM-PV-2Bit-1x16)|
| Meta-Llama-3-70B | 1x16g16 | 8.67 | 13 | [Link](https://huggingface.co/ISTA-DASLab/Meta-Llama-3-70B-AQLM-PV-2Bit-1x16)|
| Meta-Llama-3-70B | 1x16g16 | 8.67 | 13 | [Link](https://huggingface.co/ISTA-DASLab/Meta-Llama-3-70B-AQLM-PV-2Bit-1x16)|
| Mistral-7B-v0.1 | 1x16g8 | 5.22 | 2.51 | [Link](https://huggingface.co/ISTA-DASLab/Mistral-7B-v0.1-AQLM-PV-2Bit-1x16-hf) |
| Phi-3-mini-4k-instruct | 1x16g8 | 6.63 | 1.4 | [Link](https://huggingface.co/ISTA-DASLab/Phi-3-mini-4k-instruct-AQLM-PV-2Bit-1x16-hf) |



Note that models with "gs16" in their scheme require aqlm inference library v1.1.6 or newer: `pip install aqlm[gpu,cpu]>=1.1.6`

Above perplexity is evaluated on **4k** context length for Llama 2 models and **8k** for Mistral/Mixtral and Llama 3.
Please also note that token-level perplexity can only be compared within the same model family, but should not be compared between models that use different vocabularies.
While Mistral has a lower perplexity than Llama 3 8B but this does not mean that Mistral is better: Llama's perplexity is computed on a much larger dictionary and has higher per-token perplexity because of that.

Above perplexity is evaluated on **4k** context length for Llama-2 models and **8k** for Mistral/Mixtral.
Please see more evaluation results on the model pages.
For more evaluation results and detailed explanations, please see our papers: [Egiazarian et al. (2024)](https://arxiv.org/abs/2401.06118) for pure AQLM and [Malinovskii et al. (2024)](https://arxiv.org/abs/2405.14852) for PV-Tuned models.

### Inference kernels

Expand Down Expand Up @@ -109,7 +136,7 @@ To load it, use:
```python
from huggingface_hub import hf_hub_download

hf_hub_download(repo_id="Vahe1994/AQLM", filename="data/name.pth",repo_type="dataset")
hf_hub_download(repo_id="Vahe1994/AQLM", filename="data/name.pth", repo_type="dataset")
```

To use downloaded data from HF, place it in data folder(optional) and set correct path to it in "--dataset" argument in main.py.
Expand Down Expand Up @@ -204,6 +231,8 @@ There are additional hyperparameters aviailable. Run `python main.py --help` for

### Finetuning

**Note** this code will only fine-tune continuous parameters. To fine-tune both continuous and discrete parameters, please switch to [pv-tuning](https://github.com/Vahe1994/AQLM/tree/pv-tuning) branch and follow instructions in its readme.

The accuracy of the quantized model can be further improved via block finetuning. First, the logits
of the float16/bfloat16 are cached in RAM. Then the differentiable parameters of the quantized model
are optimized to minimize KL-divergence with teacher logits. Typically, we use the same calibration data that was used for model quantization.
Expand Down Expand Up @@ -238,7 +267,8 @@ Main CLI arguments:
- `--finetune_dtype` - which dtype should be used on finetuning. By default `float32`.
- `--amp` - whether to use amp on finetuning. Requires `--finetune_dtype=float32`.

**Note** for larger models one would need multi-GPU training. At the moment, FSDP training is not implemented and the model is finetuned on a single process with parameters sharded across available devices.
For larger models one would need multi-GPU training. At the moment, FSDP training is not implemented and the model is finetuned on a single process with parameters sharded across available devices.


### Zero-shot benchmarks via LM Evaluation Harness

Expand Down Expand Up @@ -297,4 +327,12 @@ If you found this work useful, please consider citing:
archivePrefix={arXiv},
primaryClass={cs.LG}
}
@misc{malinovskii2024pvtuning,
title={PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression},
author={Vladimir Malinovskii and Denis Mazur and Ivan Ilin and Denis Kuznedelev and Konstantin Burlachenko and Kai Yi and Dan Alistarh and Peter Richtarik},
year={2024},
eprint={2405.14852},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```

0 comments on commit 7a09970

Please sign in to comment.