Vahe1994 · BlackSamorez · Feb 7, 2024 · Jan 27, 2024 · Jan 16, 2024 · Jan 16, 2024
diff --git a/README.md b/README.md
@@ -2,9 +2,60 @@
 
 Official PyTorch implementation for [Extreme Compression of Large Language Models via Additive Quantization](https://arxiv.org/pdf/2401.06118.pdf)
 
-## Installation
+## Inference
 
-### Packages
+### Demo
+
+Learn how to run the prequantized models using this Google Colab example:
+
+<a target="_blank" href="https://colab.research.google.com/github/Vahe1994/AQLM/blob/main/notebooks/colab_example.ipynb">
+  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="AQLM In Colab"/>
+</a>
+
+### Models
+
+This repository is currently designed to work with models of `LLaMA`, `Mistral` and `Mixtral` families.
+
+We provide a number of prequantized models:
+
+| Model      | AQLM scheme | WikiText 2 PPL | Model size, Gb | Hub link                                                                 |
+|------------|-------------|----------------|----------------|--------------------------------------------------------------------------|
+| Llama-2-7b | 1x16        | 6.31           | 2.4            | [Link](https://huggingface.co/BlackSamorez/Llama-2-7b-AQLM-2Bit-1x16-hf) |
+| Llama-2-7b | 2x8         | 7.98           | 2.2            | [Link](https://huggingface.co/BlackSamorez/Llama-2-7b-AQLM-2Bit-2x8-hf)  |
+| Llama-2-7b | 8x8         | 7.83           | 2.2            | [Link](https://huggingface.co/BlackSamorez/Llama-2-7b-AQLM-2Bit-8x8-hf)  |
+| Llama-2-13b| 1x16        | 5.41           | 4.1            | [Link](https://huggingface.co/BlackSamorez/Llama-2-13b-AQLM-2Bit-1x16-hf)|
+| Llama-2-70b| 1x16        | 3.96           | 18.8           | [Link](https://huggingface.co/BlackSamorez/Llama-2-70b-AQLM-2Bit-1x16-hf)|
+| Mixtral-8x7b| 1x15       | 4.61           | 12.6            | [Link](https://huggingface.co/BlackSamorez/Mixtral-8x7b-AQLM-2Bit-1x15-hf)|
+
+
+### Installation
+
+
+
+To run the models, one would have to install an inference library:
+```bash
+pip install aqlm[gpu,cpu]
+```
+, specifying either `gpu`, `cpu` or both based on one's inference setting.
+
+
+Then, one can use the familiar `.from_pretrained` method provided by the [transformers](https://github.com/huggingface/transformers) library:
+```python
+from transformers import AutoModelForCausalLM
+
+quantized_model = AutoModelForCausalLM.from_pretrained(
+    "BlackSamorez/Llama-2-7b-AQLM-2Bit-1x16-hf",
+    trust_remote_code=True, torch_dtype="auto"
+).cuda()
+```
+Notice that `torch_dtype` should be set to either `torch.float16` or `"auto"` on GPU and `torch.float32` on CPU. After that, the model can be used exactly the same as one would use and unquantized model. 
+
+As of now, we provide efficient implementations for matrix-vector multiplications for `1x16` and `2x8` AQLM schemes on GPU, and `Kx8` scheme on CPU.
+
+
+## Quantization
+
+### Dependencies
 
 Install packages from `requirements.txt`:
 ```bash
@@ -16,11 +67,8 @@ pip install -r requirements.txt
 The script will require downloading and caching locally the relevant tokenizer and the datasets. 
 They will be saved in default Huggingface Datasets directory unless alternative location is provided by env variables.
 See [relevant Datasets documentation section](https://huggingface.co/docs/datasets/main/en/cache#cache-directory)
-## Models
-
-This repository is currently designed to work with models of `LLaMA ` family.
 
-## Data
+### Data
 
 When quantizing models with AQLM, we recommend that you use a subset of the original data the model was trained on.
 
@@ -48,7 +96,6 @@ __We shall add step-by-step instructions for this before Jan 13 23:59 AOE.__
 One can optionally log the data to `Weights and Biases` service (wandb).
 Run `pip install wandb` for W&B logging.
 Specify `$WANDB_ENTITY`, `$WANDB_PROJECT`, `$WANDB_NAME` environment variables prior to running experiments. use `--wandb` argument to enable logging
-# Launching
 
 ### GPU and RAM requirements
 This code was developed and tested using a several A100 GPU with 80GB GPU RAM. 
@@ -136,6 +183,15 @@ python lmeval.py \
     --batch_size 1
 ```
 
+### Preparing models for inference
+
+To convert a model into a _Hugging Face_ compatible format, use `convert_to_hf.py` with corresponding arguments:
+ - `--model` - the original pretrained model (corresponds to `MODEL_PATH` of `main.py`, e.g. `meta-llama/Llama-2-7b-hf`).
+ - `--in_path` - the folder containing an initially quantized model (corresponds to `--save` of `main.py`).
+ - `--out_path` - the folder to save `transformers` model to.
+
+The conversion automatically
+
 ## Contributing
 
 If you want to contribute something substantial (more than a typo), please open an issue first.

diff --git a/benchmark/benchmark_generate_cpu.py b/benchmark/benchmark_generate_cpu.py
@@ -0,0 +1,110 @@
+import argparse
+import os
+
+os.environ["OMP_NUM_THREADS"] = "1"
+os.environ["MKL_NUM_THREADS"] = "1"
+import time
+import warnings
+
+warnings.filterwarnings("ignore")
+
+import torch
+
+torch.set_num_threads(8)
+from torch import nn
+
+from transformers import AutoConfig, AutoModelForCausalLM
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(add_help=True)
+    parser.add_argument(
+        "--model",
+        type=str,
+        required=True,
+    )
+    parser.add_argument(
+        "--num_codebooks",
+        type=int,
+        default=None,
+    )
+    parser.add_argument(
+        "--in_group_size",
+        type=int,
+        default=None,
+    )
+    parser.add_argument(
+        "--nbits_per_codebook",
+        type=int,
+        default=None,
+    )
+    parser.add_argument(
+        "--warmup_iters",
+        type=int,
+        default=1,
+        help="Number of warmup iterations.",
+    )
+    parser.add_argument(
+        "--benchmark_iters",
+        type=int,
+        default=3,
+        help="Number of benchmark iterations.",
+    )
+    parser.add_argument(
+        "--input_length",
+        type=int,
+        default=1,
+        help="Input length.",
+    )
+    parser.add_argument(
+        "--output_length",
+        type=int,
+        default=128,
+        help="Output length.",
+    )
+    args = parser.parse_args()
+
+    device = "cpu"
+
+    config = AutoConfig.from_pretrained(args.model, trust_remote_code=True, torch_dtype=torch.float32)
+    if args.num_codebooks is not None:
+        config.aqlm["num_codebooks"] = args.num_codebooks
+    if args.in_group_size is not None:
+        config.aqlm["in_group_size"] = args.in_group_size
+    if args.nbits_per_codebook is not None:
+        config.aqlm["nbits_per_codebook"] = args.nbits_per_codebook
+
+    real_num_layers = config.num_hidden_layers
+    if "meta-llama" in args.model:
+        config.num_hidden_layers = 1
+    aqlm_model = AutoModelForCausalLM.from_config(config, trust_remote_code=True, torch_dtype=torch.float32)
+
+    if "meta-llama" in args.model:
+        aqlm_model.config.num_hidden_layers = real_num_layers
+        layer = aqlm_model.model.layers[0]
+        aqlm_model.model.layers = nn.ModuleList([])
+        for i in range(real_num_layers):
+            another_layer = type(layer)(config, i)
+
+            another_layer.self_attn.q_proj.weight.data = layer.self_attn.q_proj.weight.data
+            another_layer.self_attn.k_proj.weight.data = layer.self_attn.k_proj.weight.data
+            another_layer.self_attn.v_proj.weight.data = layer.self_attn.v_proj.weight.data
+            another_layer.self_attn.o_proj.weight.data = layer.self_attn.o_proj.weight.data
+            another_layer.mlp.up_proj.weight.data = layer.mlp.up_proj.weight.data
+            another_layer.mlp.down_proj.weight.data = layer.mlp.down_proj.weight.data
+            another_layer.mlp.gate_proj.weight.data = layer.mlp.gate_proj.weight.data
+
+            another_layer.self_attn.layer_idx = i
+            aqlm_model.model.layers.append(another_layer)
+
+        aqlm_model.model.config.num_hidden_layers = real_num_layers
+
+    prompt = torch.randint(low=0, high=aqlm_model.config.vocab_size, size=(1, args.input_length), device=device)
+
+    for i in range(args.warmup_iters + args.benchmark_iters):
+        aqlm_model.generate(prompt, min_new_tokens=args.output_length, max_new_tokens=args.output_length)
+        if i == args.warmup_iters - 1:
+            t_s = time.perf_counter()
+    t_e = time.perf_counter()
+
+    tokens_per_second = args.benchmark_iters * args.output_length / (t_e - t_s)
+    print(f"<Tokens per second> = {tokens_per_second:.3f}")
diff --git a/benchmark/generate_benchmark.py b/benchmark/generate_benchmark.py
@@ -0,0 +1,107 @@
+import argparse
+import os
+import time
+import warnings
+
+warnings.filterwarnings("ignore")
+import torch
+import torch.nn as nn
+from tqdm import trange
+
+from transformers import AutoConfig, AutoModelForCausalLM
+
+if __name__ == "__main__":
+    assert torch.cuda.is_available()
+    device = torch.device("cuda")
+
+    parser = argparse.ArgumentParser(add_help=True)
+
+    parser.add_argument(
+        "--model",
+        type=str,
+        required=True,
+    )
+    parser.add_argument(
+        "--warmup_iters",
+        type=int,
+        default=1,
+        help="Number of warmup iterations.",
+    )
+    parser.add_argument(
+        "--benchmark_iters",
+        type=int,
+        default=10,
+        help="Number of benchmark iterations.",
+    )
+    parser.add_argument(
+        "--input_length",
+        type=int,
+        default=1,
+        help="Input length.",
+    )
+    parser.add_argument(
+        "--output_length",
+        type=int,
+        default=128,
+        help="Output length.",
+    )
+    parser.add_argument(
+        "--real_model",
+        action="store_true",
+    )
+    parser.add_argument(
+        "--low_cpu_mem_usage",
+        action="store_true",
+    )
+
+    args = parser.parse_args()
+
+
+def load_model(model_name, device="cuda"):
+    return AutoModelForCausalLM.from_pretrained(
+        model_name,
+        trust_remote_code=True,
+        torch_dtype="auto",
+    ).to(device)
+
+
+def load_shared_model(model_name, device="cuda"):
+    config = AutoConfig.from_pretrained(model_name, trust_remote_code=True)
+    num_layers = config.num_hidden_layers
+    config.num_hidden_layers = 1
+    model = AutoModelForCausalLM.from_config(config, trust_remote_code=True, torch_dtype=torch.float16).to(device)
+    layer = model.model.layers[0]
+    for i in trange(1, num_layers, desc="Copying block parameters"):
+        new_layer = type(layer)(model.config, i).to(device)
+        for new_layer_param, layer_param in zip(new_layer.parameters(), layer.parameters()):
+            new_layer_param.data = layer_param.data
+        new_layer.self_attn.layer_idx = i
+        model.model.layers.append(new_layer)
+    return model
+
+
+if __name__ == "__main__":
+    assert torch.cuda.is_available()
+    device = torch.device("cuda")
+
+    parser = argparse.ArgumentParser(add_help=True)
+
+    config = AutoConfig.from_pretrained(args.model, trust_remote_code=True)
+
+    if args.real_model:
+        aqlm_model = load_model(args.model, device)
+    else:
+        aqlm_model = load_shared_model(args.model, device)
+
+    prompt = torch.randint(low=0, high=aqlm_model.config.vocab_size, size=(1, args.input_length), device=device)
+
+    for i in range(args.warmup_iters + args.benchmark_iters):
+        output = aqlm_model.generate(prompt, min_new_tokens=args.output_length, max_new_tokens=args.output_length)
+        if i == args.warmup_iters - 1:
+            torch.cuda.synchronize(device)
+            t_s = time.perf_counter()
+    torch.cuda.synchronize(device)
+    t_e = time.perf_counter()
+
+    tokens_per_second = args.benchmark_iters * args.output_length / (t_e - t_s)
+    print(f"<Tokens per second> = {tokens_per_second:.2f}")