Skip to content

AngeloGalav/compressed-llama

Repository files navigation

compressed-llama

Compressed Llama

"Making LLaMA super small to run on low-power edge devices!"

This project presents a comprehensive framework that integrates multiple compression techniques applied to a distilled LLaMA 3.2 1B Instruct model. To mitigate the inherent performance degradation, recovery methods including LoRA and EoRA are also implemented throughout the pipeline. The resulting small language models (SLMs) are optimized for deployment on resource-constrained devices.

The framework represents several months of research focused on identifying compression and recovery approaches that balance computational efficiency with effectiveness, and it is essentially a follow-up of the Franken-Llama project.

You can find the full master thesis, containing all the details regarding the inner workings of LLMs, this project and the development process at this repo.

Setup

Since GPTQModel support on Windows is spotty at best, I strongly suggest to stick to Linux to use this code.

Create a Python environment with venv and install the required dependencies:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

How to use

In the following section, an example command for the compression and benchmarking pipeline is shown. The complete list of features provided by these two main scripts can be found in the documentation.

Compression Pipeline

Provided that you've already activated the PIP environment, executing the following will generate a model, executing each stage of the pipeline:

python compression_pipeline.py --num_layers_to_prune 2 --prune_n 2 --prune_m 4 --model_tag "depth_ppl_2_width_2_4"

This will apply the optimizations in the following order:

  1. Depth Pruning (pruning 2 layers using PPL importance)
  2. Width Pruning (using ratio 2:4)
  3. LoRA
  4. Quantization
  5. EoRA

And will output the results with the tag "depth_ppl_2_width_2_4"

Evaluation Pipeline

Provided that you've already activated the PIP environment, you can test the model on the Wikitext and TriviaQA benchmarks using this command:

python evaluate.py --model_path [path_to_model]

Documentation

A full explanation of the modules that make this project is provided here. In addition, the functionality of each Bash script is listed here.

Folder structure

compressed-llama/
├── .devcontainer/
├── evaluation_results/
├── scripts/
│   ├── analyze_from_list.sh
│   ├── analyze_script.sh
│   ├── benchmark_models.sh
│   ├── generate_configurations.sh
│   ├── get_logs.sh
│   ├── lora_from_list.sh
│   ├── lora_list_experiment.sh
│   └── model_analyzer.py
├── src/
│   ├── depth/
│   │   ├── depth_pruner.py
│   │   └── layer_importance_calculator.py
│   ├── wanda/
│   │   ├── wanda_pruner.py
│   │   └── wrappers.py
│   ├── data_utils.py
│   ├── evaluation.py
│   ├── logging.py
│   ├── lorafy.py
│   ├── metrics.py
│   ├── quantize.py
│   └── script_utils.py
├── compression_pipeline.py
├── evaluate.py
├── README.md
└── requirements.txt

Notebooks

The demos folder contains a series of notebooks with some contained examples of the the compression/performance recovery techniques that were employed in this project.

Results

A comment and review of the results is provided in the final thesis.

Main References

  • ModelCloud, GPTQModel, URL: https://github.com/ModelCloud/GPTQModel

  • Arpan Suravi Prasad, Moritz Scherer, Francesco Conti, Davide Rossi, Alfio Di Mauro, Manuel Eggimann, Jorge Tómas Gómez, Ziyun Li, Syed Shakib Sarwar, Zhao Wang, Barbara De Salvo, Luca Benini, Siracusa: A 16 nm Heterogenous RISC-V SoC for Extended Reality with At-MRAM Neural Engine, 2023. arXiv: 2312.14750. URL: https://arxiv.org/abs/2312.14750

  • Bo-Kyeong Kim, Geonmin Kim, Tae-Ho Kim, Thibault Castells, Shinkook Choi, Junho Shin, Hyoung-Kyu Song, Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods, 2024. arXiv: 2402.02834. URL: https://arxiv.org/abs/2402.02834

  • Mingjie Sun, Zhuang Liu, Anna Bair, J. Zico Kolter, A Simple and Effective Pruning Approach for Large Language Models, 2023. arXiv: 2306.11695. URL: https://arxiv.org/abs/2306.11695

  • Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh, GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers, 2022. arXiv: 2210.17323. URL: https://arxiv.org/abs/2210.17323

  • Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, LoRA: Low-Rank Adaptation of Large Language Models, 2021. arXiv: 2106.09685. URL: https://arxiv.org/abs/2106.09685

  • Shih-Yang Liu, Maksim Khadkevich, Nai Chit Fung, Charbel Sakr, Chao-Han Huck Yang, Chien-Yi Wang, Saurav Muralidharan, Hongxu Yin, Kwang-Ting Cheng, Jan Kautz, Yu-Chiang Frank Wang, Pavlo Molchanov, Min-Hung Chen, EoRA: Fine-tuning-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation, 2024. arXiv: 2410.21271. URL: https://arxiv.org/abs/2410.21271

  • Llama Team, AI @ Meta, Llama-3.2-1B, Hugging Face Model, 2024. URL: https://huggingface.co/meta-llama/Llama-3.2-1B

About

Compression pipeline for Llama-based models.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •