Skip to content

ljm565/universal-llm-trainer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Universal LLM Trainer

Recent updates 📣

  • August 2025 (v1.6.1): We updated more GPU test results.
  • April 2025 (v1.6.0): Universal LLM trainer now has supported Eleuther AI's harness evaluation.
  • April 2025 (v1.5.8): Update multi-turn data training codes of Phi3, LLaMA 3, LLaMA 3.1, Gemma, and Gemma 2.
  • April 2025 (v1.5.7): Update logic to not generate answers during the model validation step if they are not needed for efficient training.
  • April 2025 (v1.5.6): Update code to convert our model checkpoints to Hugging Face model format.
  • April 2025 (v1.5.5): Logging methods have been simplified. Universal LLM trainer saves optimizer states and model checkpoints, and supports two LoRA adapter saving methods: LoRA merged model and LoRA adapter only.
  • March 2025 (v1.5.4): Universal LLM trainer supports Llama 3.1 70B LoRA training and GPU memory usage during FSDP model training has been improved.
  • March 2025 (v1.5.3): QLoRA test results have been added.
  • March 2025 (v1.5.2): Universal LLM trainer does not support KoPolyglot and KoGemma, and support Llama 2 and Gemma 1. Also, GPU memory usage during model training has been improved.
  • March 2025 (v1.5.1): Universal LLM trainer does not support unnecessary fucntions (e.g. NMT, translator).
  • February 2025 (v1.5.0): Universal LLM trainer has supported LLaMA 2 template. Please refer to the tempaltes folder.
  • November 2024 (v1.4.4): Universal LLM trainer has completely supported FSDP training.
  • October 2024 (v1.4.3): Change QLoRA configuration.
  • September 2024 (v1.4.2): Fix Rank-zero bug.
  • September 2024 (v1.4.1): Initialize FSDP and updated DDP training codes.
  • July 2024 (v1.4.0): Universal LLM trainer has supported QLoRA training.
  • June 2024 (v1.3.0): Validation codes are updated. LLaMA 3 template and Gemma model are added.
  • May 2024 (v1.2.0): Universal LLM trainer has supported Phi 3 model. Also, model freezing and resumimg training are possible.
  • April 2024 (v1.1.0): Universal LLM trainer has supported herd of LLaMA 3 model series.
  • March 2024 (v1.0.0): Universal LLM trainer has supported DDP training and tensorboard logging.

 

 

Overview 📚

This repository is designed to make it easy for anyone to tune models available on Hugging Face. When a new model is released, anyone can easily implement a model wrapper to perform instruction-tuning and fine-tuning. For detailed usage instructions, please refer to the description below.

  • Universal LLM trainer supports full-training.
  • Universal LLM trainer supports LoRA fine-tuning.
  • Universal LLM trainer supports QLoRA fine-tuning.
  • Universal LLM trainer supports DDP and FSDP training strategies.

 

GPU memory and training speed

Below is an example of the memory requirements and training speed for different models. For the results of tests with more GPUs, please refer to this page.

Note

Training conditions:

  • Environments: Ubuntu 22.04.4 LTS, torch==2.5.1, transformers==4.49.0
  • Batch size: 2
  • Sequnece length: 8,192 (Without padding, fully filled tokens)
  • Model type: torch.bfloat16
  • Optimizer: torch.optim.AdamW
  • w/ Gradient checkpointing (Tests were done with both "torch" and "Hugging Face" gradient checkpointing methods)
  • Gradient accumulation:
    • Full fine-tuning: 1
    • LoRA: 32
Model Tuning Method GPU Peak Mem. (Model Mem.) Sec/step
Llama 3.1 8B Full H100 x 1 78 GiB (16 GiB) 4.7
Llama 3.1 8B LoRA H100 x 1 36 GiB (16 GiB) 6.4
Llama 3.1 8B ** QLoRA H100 x 1 48 GiB (8.0 GiB) 26.1
Llama 3.1 70B * LoRA H100 x 2 66 GiB (CPU Offload) 40.2
Llama 3 8B Full H100 x 1 78 GiB (16 GiB) 4.7
Llama 3 8B LoRA H100 x 1 36 GiB (16 GiB) 6.4
Llama 3 8B ** QLoRA H100 x 1 48 GiB (8.0 GiB) 26.1
Llama 2 13B * Full H100 x 2 31 GiB (CPU Offload) 9.5
Llama 2 13B LoRA H100 x 1 43 GiB (25.5 GiB) 9.8
Llama 2 13B ** QLoRA H100 x 1 38 GiB (8.3 GiB) 43.0
Gemma 2 9B * Full H100 x 2 59 GiB (CPU Offload) 12.6
Gemma 2 9B LoRA H100 x 1 60 GiB (18 GiB) 12.9
Gemma 2 9B ** QLoRA H100 x 1 OOM (8.4 GiB) OOM
Gemma 7B * Full H100 x 2 48 GiB (CPU Offload) 8.7
Gemma 7B LoRA H100 x 1 51 GiB (17 GiB) 9.5
Gemma 7B ** QLoRA H100 x 1 70 GiB (7.5 GiB) 27.4
Phi3-mini (3.8B) Full H100 x 1 40 GiB (8 GiB) 4.0
Phi3-mini (3.8B) LoRA H100 x 1 17 GiB (8 GiB) 5.0
Phi3-mini (3.8B) ** QLoRA H100 x 1 21 GiB (3.2 GiB) 17.6

*: FSDP training with CPU offloading + 32 gradient accumuation.
**: 4-bit QLoRA training. QLoRA does not always use less GPU than LoRA, but it varies depending on sequence length and model size. Experimentally, QLoRA use less GPU when less than 1,500 sequence length. Please refer to Google document.

 

 

Quick Starts 🚀

Environment setup

We have to install PyTorch and other requirements. Please refer to more detailed setup including Docker.

# PyTorch Install
pip3 install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124

# Requirements Install
pip3 install -r docker/requirements.txt

 

Data preparation

python3 src/run/dataset_download.py --dataset allenai/ai2_arc --download_path data_examples

 

LLM training

# Llama 3.1 8B LoRA fine-tuning
python3 src/run/train.py --config config/example_llama3.1_lora.yaml --mode train

# Llama 3.1 8B QLoRA fine-tuning
python3 src/run/train.py --config config/example_llama3.1_qlora.yaml --mode train

# Llama 3.1 8B full fine-tuning
python3 src/run/train.py --config config/example_llama3.1_full.yaml --mode train

 

 

Tutorials & Documentations

  1. Getting Started
  2. Data Preparation
  3. Training

 

 

Bug Reports

If an error occurs while executing the code, check if any of the cases below apply.

About

Universal LLM Trainer including LoRA, QLoRA, deepspeed, etc.

Resources

Stars

Watchers

Forks

Packages

No packages published