A minimal implementation for running instruction tuning and inference tasks on LLaMA2 using a single NVIDIA A100 GPU. Applied techniques include Low-rank Adaptation, Auto-mix-precision Training, Gradient Scaling, and Gradient Checkpointing.
To download LLaMA weights and tokenizer, please visit the Meta website and accept the License. Instructions
Tested on
- gcc/11.3.0
- cuda/11.8.0
- python/3.9.12
- pytorch/2.1.0
-
Change
model_path
,tokenizer_path
, andlora_weights_path
ininference.py
python inference.py
-
python finetune.py
For n_layers
= 8 (number of transformer blocks, default=32) , and epochs
= 5
Configuration | Trainable Parameters | GPU Memory Usage (MiB) | Training Time (seconds) |
---|---|---|---|
Original | 1,881,214,976 | 38,401 | / |
+ Low-rank Adaptation | 2,097,152 | 10,377 | 70.31 |
+ Auto-mix-precision Training & Gradient Scaling | 2,097,152 | 13,079 | 25.96 |
+ Gradient Accumulation | 2,097,152 | 13,089 | 25.12 |
+ Gradient Checkpointing | 2,097,152 | 9,409 | 45.98 |