Skip to content

[ICML 2024]: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models

License

Notifications You must be signed in to change notification settings

lihuang258/LoRAP

Repository files navigation

[ICML 2024] LoRAP: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models

image

Introduction

[LoRAP: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models] [ArXiv]
Guangyan Li,Yongqiang Tang, Wensheng Zhang
Institute of Automation, Chinese Academy of Sciences

Supported LLMs:

Table of Contents

Installation

Instructions of Model compression environment can be found in INSTALL.md.

The evaluation environment is consistent with LLM-Pruner and can be referred to requirement.txt

Minimal Example

bash llama_7b.sh

This script would compress the LLaMA-7B model with 20% parameters by LoRAP.

Compression Instruction

LLaMA-7B compressed with ~20% parameters:

python main.py \
    --model decapoda-research/llama-7b-hf \
    --dataset bookcorpus \
    --sparsity_ratio 0.2 \
    --para_allocate 3 \
    --mlp_compress_method prune \
    --deco_method AWSVD \
    --sublayer self_attn,mlp \
    --save_model "compressed_model/lorap_0.2/" \
    --real_com False \

Arguments:

  • --model: The identifier for the LLaMA model on the Hugging Face model hub. The model name is used for AutoModelForCausalLM.from_pretrained to load the pre-trained LLM. For example, if you want to use the LLaMA with 7 billion parameters, than pass decapoda-research/llama-7b-hf to --model.
  • --dataset: The dataset of calibraton data, you can choose from [c4, PTB, wikitest2,bookcorpus]. The default is bookcorpus.
  • --sparsity_ratio: Proportion of reduced model parameters.
  • --para_allocate: The parameter ratio of (Wv + Wo) : (Wq + Wk).
  • --mlp_compress_method: Compression methods for mlp sublayers, namely [prune, decom]. The default is prune.
  • --deco_method: The method used for matrix factorization, namely [AWSVD, AFM, SVD]. The default is AWSVD.
  • --sublayer: Sublayers that require compression, the default is [self_attn, mlp], you also can self_attn or mlp.
  • --save_model: Specifies the directory where the compressed model will be stored.
  • --real_com: Whether to actually compress the model.

After compression, we refer to LLM-Pruner for Lora fine-tuning as well as evaluation. The latest version of the evaluation is lm-evaluation-harness. Since LoRA fine-tuning only supports torch.nn.Linear and Conv1D, the model isn't compressed during compression. Instead, after fine-tuning, the model is decomposed once again based on After_tune.py.

Model Evaluation

The performance of compressed model on language modeling and zero-shot :


More results can be found in the paper.

Acknowledgement

Citation

If you find this project useful, please cite

@misc{li2024lorap,
      title={LoRAP: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models}, 
      author={Guangyan Li and Yongqiang Tang and Wensheng Zhang},
      year={2024},
      eprint={2404.09695},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

About

[ICML 2024]: Transformer Sub-Layers Deserve Differentiated Structured Compression for Large Language Models

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages