Skip to content
/ llama Public

Efficient fine-tuning of LLaMA2 7B on a single GPU

Notifications You must be signed in to change notification settings

ichuniq/llama

Repository files navigation

Efficient fine-tuning of LLaMA2 7B on a single GPU

A minimal implementation for running instruction tuning and inference tasks on LLaMA2 using a single NVIDIA A100 GPU. Applied techniques include Low-rank Adaptation, Auto-mix-precision Training, Gradient Scaling, and Gradient Checkpointing.

Installation

Model

To download LLaMA weights and tokenizer, please visit the Meta website and accept the License. Instructions

Environment

Tested on

  • gcc/11.3.0
  • cuda/11.8.0
  • python/3.9.12
  • pytorch/2.1.0

Usage

  • Inference

    Change model_path, tokenizer_path, and lora_weights_path in inference.py

    python inference.py
    
  • Finetuning

    python finetune.py
    

Results

Memory Usage

For n_layers = 8 (number of transformer blocks, default=32) , and epochs = 5

Configuration Trainable Parameters GPU Memory Usage (MiB) Training Time (seconds)
Original 1,881,214,976 38,401 /
+ Low-rank Adaptation 2,097,152 10,377 70.31
+ Auto-mix-precision Training & Gradient Scaling 2,097,152 13,079 25.96
+ Gradient Accumulation 2,097,152 13,089 25.12
+ Gradient Checkpointing 2,097,152 9,409 45.98

References

About

Efficient fine-tuning of LLaMA2 7B on a single GPU

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages