Universal LLM Trainer

Recent updates 📣

August 2025 (v1.6.1): We updated more GPU test results.
April 2025 (v1.6.0): Universal LLM trainer now has supported Eleuther AI's harness evaluation.
April 2025 (v1.5.8): Update multi-turn data training codes of Phi3, LLaMA 3, LLaMA 3.1, Gemma, and Gemma 2.
April 2025 (v1.5.7): Update logic to not generate answers during the model validation step if they are not needed for efficient training.
April 2025 (v1.5.6): Update code to convert our model checkpoints to Hugging Face model format.
April 2025 (v1.5.5): Logging methods have been simplified. Universal LLM trainer saves optimizer states and model checkpoints, and supports two LoRA adapter saving methods: LoRA merged model and LoRA adapter only.
March 2025 (v1.5.4): Universal LLM trainer supports Llama 3.1 70B LoRA training and GPU memory usage during FSDP model training has been improved.
March 2025 (v1.5.3): QLoRA test results have been added.
March 2025 (v1.5.2): Universal LLM trainer does not support KoPolyglot and KoGemma, and support Llama 2 and Gemma 1. Also, GPU memory usage during model training has been improved.
March 2025 (v1.5.1): Universal LLM trainer does not support unnecessary fucntions (e.g. NMT, translator).
February 2025 (v1.5.0): Universal LLM trainer has supported LLaMA 2 template. Please refer to the tempaltes folder.
November 2024 (v1.4.4): Universal LLM trainer has completely supported FSDP training.
October 2024 (v1.4.3): Change QLoRA configuration.
September 2024 (v1.4.2): Fix Rank-zero bug.
September 2024 (v1.4.1): Initialize FSDP and updated DDP training codes.
July 2024 (v1.4.0): Universal LLM trainer has supported QLoRA training.
June 2024 (v1.3.0): Validation codes are updated. LLaMA 3 template and Gemma model are added.
May 2024 (v1.2.0): Universal LLM trainer has supported Phi 3 model. Also, model freezing and resumimg training are possible.
April 2024 (v1.1.0): Universal LLM trainer has supported herd of LLaMA 3 model series.
March 2024 (v1.0.0): Universal LLM trainer has supported DDP training and tensorboard logging.

Overview 📚

This repository is designed to make it easy for anyone to tune models available on Hugging Face. When a new model is released, anyone can easily implement a model wrapper to perform instruction-tuning and fine-tuning. For detailed usage instructions, please refer to the description below.

Universal LLM trainer supports full-training.
Universal LLM trainer supports LoRA fine-tuning.
Universal LLM trainer supports QLoRA fine-tuning.
Universal LLM trainer supports DDP and FSDP training strategies.

GPU memory and training speed

Below is an example of the memory requirements and training speed for different models. For the results of tests with more GPUs, please refer to this page.

Note

Training conditions:

Environments: Ubuntu 22.04.4 LTS, torch==2.5.1, transformers==4.49.0
Batch size: 2
Sequnece length: 8,192 (Without padding, fully filled tokens)
Model type: torch.bfloat16
Optimizer: torch.optim.AdamW
w/ Gradient checkpointing (Tests were done with both "torch" and "Hugging Face" gradient checkpointing methods)
Gradient accumulation:
- Full fine-tuning: 1
- LoRA: 32

Model	Tuning Method	GPU	Peak Mem. (Model Mem.)	Sec/step
Llama 3.1 8B	Full	H100 x 1	78 GiB (16 GiB)	4.7
Llama 3.1 8B	LoRA	H100 x 1	36 GiB (16 GiB)	6.4
Llama 3.1 8B **	QLoRA	H100 x 1	48 GiB (8.0 GiB)	26.1
Llama 3.1 70B *	LoRA	H100 x 2	66 GiB (CPU Offload)	40.2
Llama 3 8B	Full	H100 x 1	78 GiB (16 GiB)	4.7
Llama 3 8B	LoRA	H100 x 1	36 GiB (16 GiB)	6.4
Llama 3 8B **	QLoRA	H100 x 1	48 GiB (8.0 GiB)	26.1
Llama 2 13B *	Full	H100 x 2	31 GiB (CPU Offload)	9.5
Llama 2 13B	LoRA	H100 x 1	43 GiB (25.5 GiB)	9.8
Llama 2 13B **	QLoRA	H100 x 1	38 GiB (8.3 GiB)	43.0
Gemma 2 9B *	Full	H100 x 2	59 GiB (CPU Offload)	12.6
Gemma 2 9B	LoRA	H100 x 1	60 GiB (18 GiB)	12.9
Gemma 2 9B **	QLoRA	H100 x 1	OOM (8.4 GiB)	OOM
Gemma 7B *	Full	H100 x 2	48 GiB (CPU Offload)	8.7
Gemma 7B	LoRA	H100 x 1	51 GiB (17 GiB)	9.5
Gemma 7B **	QLoRA	H100 x 1	70 GiB (7.5 GiB)	27.4
Phi3-mini (3.8B)	Full	H100 x 1	40 GiB (8 GiB)	4.0
Phi3-mini (3.8B)	LoRA	H100 x 1	17 GiB (8 GiB)	5.0
Phi3-mini (3.8B) **	QLoRA	H100 x 1	21 GiB (3.2 GiB)	17.6

*: FSDP training with CPU offloading + 32 gradient accumuation.
**: 4-bit QLoRA training. QLoRA does not always use less GPU than LoRA, but it varies depending on sequence length and model size. Experimentally, QLoRA use less GPU when less than 1,500 sequence length. Please refer to Google document.

Quick Starts 🚀

Environment setup

We have to install PyTorch and other requirements. Please refer to more detailed setup including Docker.

# PyTorch Install
pip3 install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124

# Requirements Install
pip3 install -r docker/requirements.txt

Data preparation

python3 src/run/dataset_download.py --dataset allenai/ai2_arc --download_path data_examples

LLM training

# Llama 3.1 8B LoRA fine-tuning
python3 src/run/train.py --config config/example_llama3.1_lora.yaml --mode train

# Llama 3.1 8B QLoRA fine-tuning
python3 src/run/train.py --config config/example_llama3.1_qlora.yaml --mode train

# Llama 3.1 8B full fine-tuning
python3 src/run/train.py --config config/example_llama3.1_full.yaml --mode train

Tutorials & Documentations

Bug Reports

If an error occurs while executing the code, check if any of the cases below apply.

Bug Cases

Name		Name	Last commit message	Last commit date
Latest commit History 335 Commits
.vscode		.vscode
config		config
config_lora		config_lora
config_quant		config_quant
demo		demo
docker		docker
docs		docs
src		src
templates		templates
.gitignore		.gitignore
README.md		README.md
version.txt		version.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Universal LLM Trainer

Recent updates 📣

Overview 📚

GPU memory and training speed

Quick Starts 🚀

Environment setup

Data preparation

LLM training

Tutorials & Documentations

Bug Reports

About

Uh oh!

Releases

Packages

Uh oh!

Languages

ljm565/universal-llm-trainer

Folders and files

Latest commit

History

Repository files navigation

Universal LLM Trainer

Recent updates 📣

Overview 📚

GPU memory and training speed

Quick Starts 🚀

Environment setup

Data preparation

LLM training

Tutorials & Documentations

Bug Reports

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages