🔥 LEAD: Iterative Data Selection for Efficient LLM Instruction Tuning

🚧 Please note that this repository is still under construction! 🚧

✨ Official repository for the paper "LEAD: Iterative Data Selection for Efficient LLM Instruction Tuning"

📢 News

[May 09, 2025] LEAD is publicly released!
[May 09, 2025] We released our data pool !

📋 Overview

Instruction tuning has emerged as a critical paradigm for improving the capabilities and alignment of large language models (LLMs). However, existing iterative model-aware data selection methods incur significant computational overhead, as they rely on repeatedly performing full-dataset model inference to estimate sample utility for subsequent training iterations, creating a fundamental efficiency bottleneck.

In this paper, we propose LEAD, an efficient iterative data selection framework that accurately estimates sample utility entirely within the standard training loop, eliminating the need for costly additional model inference. At its core, LEAD introduces Instance-Level Dynamic Uncertainty (IDU), a theoretically grounded utility function combining instantaneous training loss, gradient-based approximation of loss changes, and exponential smoothing of historical loss signals. To further scale efficiently to large datasets, LEAD employs a two-stage, coarse-to-fine selection strategy, adaptively prioritizing informative clusters through a multi-armed bandit mechanism, followed by precise fine-grained selection of high-utility samples using IDU. Extensive experiments across four diverse benchmarks show that LEAD significantly outperforms state-of-the-art methods, improving average model performance by 6.1%-10.8% while using only 2.5% of the training data and reducing overall training time by 5-10×.

🔗 Quick Links

🔥 Iterative Data Selection for Efficient LLM Instruction Tuning

⚙️ Environment Setup

Clone the repository

git clone https://github.com/HKUSTDial/LEAD.git
cd LEAD

Create and activate a conda environment:

conda create --name lead python=3.10.15
conda activate lead
pip install -r requirements.txt

📥 Data Preparation

We follow the open-instruct repo to prepare eight instruction tuning datasets. In our project, we utilize a combination of eight training datasets: WizardLM (ShareGPT) , WizardLM (Alpaca), UltraChat, unnatural, Alpaca code, Standard Alpaca, MATH and GSM8K.

A processed version of these files are available here.

🌠 Running LEAD

Step 1: Warmup Training

Edit run_warmup_training.sh to set your own BASE_DIR, TRAIN_FILE, MODEL_NAME_OR_PATH and OUTPUT_DIR

export CUDA_VISIBLE_DEVICES=0,1,2,3                 # Set your CUDA
BASE_DIR="/path/to/LEAD"                            # Root for the project
TRAIN_FILE="/path/to/random_6k_data.jsonl"          # path to training data(6k) randomly selected in the data pool
MODEL_NAME_OR_PATH="/path/to/pretrained_model"      # Dir to the pretrained model to finetune the warmup model
OUTPUT_DIR="/path/to/save/warmup_model"             # Dir to save the warmup model

Run the following script to train a warm up model for difficulty clustering
```
bash scripts/run_warmup_training.sh
```

Note: We only use 6k data in the data pool to train this model.

Step 2: Data Processing Pipeline (Offline)

Edit run_scoring.sh to set your own BASE_DIR, MODEL_NAME_OR_PATH and OUTPUT_DIR

export CUDA_VISIBLE_DEVICES=0,1,2,3              # Set your CUDA
BASE_DIR="/path/to/LEAD"                         # Root for the project
WARMUP_MODEL_PATH="/path/to/warmup_model"        # Dir to the warmup model trained in Step 1
BASE_MODEL_PATH="/path/to/pretrained_model"      # Dir to the pretrained model to get the initial iu score
OUTPUT_DIR="/path/to/save/data_pool"             # Dir to save the processed data pool to train lead

Run the following script to generate initial cluster training data. This script contains the following steps:
- Calculating Difficulty Scores
- Calculating Initial IU Scores
- Performing Clustering
```
bash scripts/run_scoring.sh
```

Step 3: Two-Stage Coarse-to-Fine Data Selection and Training (Online)

Edit run_lead.sh to set your own BASE_DIR, MODEL_NAME_OR_PATH, NUM_GPUS and OUTPUT_DIR.

export CUDA_VISIBLE_DEVICES=0,1,2,3        # Set your CUDA
BASE_DIR="/path/to/LEAD"                   # Root for the project
MODEL_NAME_OR_PATH="/path/to/model"        # Dir to pretrained model to train lead
NUM_GPUS=4                                 # Number of GPUs
OUTPUT_DIR="/path/to/save/sft_model"       # Dir to save sft model

Run the following script
```
bash scripts/run_lead.sh
```

Step 4: Evaluation

We follow the instructions in the open-instruct folder to evaluate the performance of the model trained on the selected data.

Merge the lora model
```
bash scripts/lora_merge.sh
```
Evaluate diverse benchmarks
```
bash scripts/eval.sh
```

✉️ Contact

If you have any questions related to the code or the paper, feel free to contact: xlin420@connect.hkust-gz.edu.cn. If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!

📝 Citation

If you find our work useful or inspiring, please kindly cite:

@misc{lin2025leaditerativedataselection,
      title={LEAD: Iterative Data Selection for Efficient LLM Instruction Tuning}, 
      author={Xiaotian Lin and Yanlin Qi and Yizhang Zhu and Themis Palpanas and Chengliang Chai and Nan Tang and Yuyu Luo},
      year={2025},
      eprint={2505.07437},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2505.07437}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
assets		assets
data		data
scripts		scripts
src		src
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🔥 LEAD: Iterative Data Selection for Efficient LLM Instruction Tuning

🚧 Please note that this repository is still under construction! 🚧

✨ Official repository for the paper "LEAD: Iterative Data Selection for Efficient LLM Instruction Tuning"

📢 News

📋 Overview

🔗 Quick Links

⚙️ Environment Setup

📥 Data Preparation

🌠 Running LEAD

Step 1: Warmup Training

Step 2: Data Processing Pipeline (Offline)

Step 3: Two-Stage Coarse-to-Fine Data Selection and Training (Online)

Step 4: Evaluation

✉️ Contact

📝 Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

HKUSTDial/LEAD

Folders and files

Latest commit

History

Repository files navigation

🔥 LEAD: Iterative Data Selection for Efficient LLM Instruction Tuning

🚧 Please note that this repository is still under construction! 🚧

✨ Official repository for the paper "LEAD: Iterative Data Selection for Efficient LLM Instruction Tuning"

📢 News

📋 Overview

🔗 Quick Links

⚙️ Environment Setup

📥 Data Preparation

🌠 Running LEAD

Step 1: Warmup Training

Step 2: Data Processing Pipeline (Offline)

Step 3: Two-Stage Coarse-to-Fine Data Selection and Training (Online)

Step 4: Evaluation

✉️ Contact

📝 Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages