✨ Official repository for the paper "LEAD: Iterative Data Selection for Efficient LLM Instruction Tuning"
[May 09, 2025] LEAD is publicly released!
[May 09, 2025] We released our data pool !
Instruction tuning has emerged as a critical paradigm for improving the capabilities and alignment of large language models (LLMs). However, existing iterative model-aware data selection methods incur significant computational overhead, as they rely on repeatedly performing full-dataset model inference to estimate sample utility for subsequent training iterations, creating a fundamental efficiency bottleneck.
In this paper, we propose LEAD, an efficient iterative data selection framework that accurately estimates sample utility entirely within the standard training loop, eliminating the need for costly additional model inference. At its core, LEAD introduces Instance-Level Dynamic Uncertainty (IDU), a theoretically grounded utility function combining instantaneous training loss, gradient-based approximation of loss changes, and exponential smoothing of historical loss signals. To further scale efficiently to large datasets, LEAD employs a two-stage, coarse-to-fine selection strategy, adaptively prioritizing informative clusters through a multi-armed bandit mechanism, followed by precise fine-grained selection of high-utility samples using IDU. Extensive experiments across four diverse benchmarks show that LEAD significantly outperforms state-of-the-art methods, improving average model performance by 6.1%-10.8% while using only 2.5% of the training data and reducing overall training time by 5-10×.
-
Clone the repository
git clone https://github.com/HKUSTDial/LEAD.git cd LEAD
-
Create and activate a conda environment:
conda create --name lead python=3.10.15 conda activate lead pip install -r requirements.txt
We follow the open-instruct repo to prepare eight instruction tuning datasets. In our project, we utilize a combination of eight training datasets: WizardLM (ShareGPT) , WizardLM (Alpaca), UltraChat, unnatural, Alpaca code, Standard Alpaca, MATH and GSM8K.
A processed version of these files are available here.
-
Edit
run_warmup_training.sh
to set your ownBASE_DIR
,TRAIN_FILE
,MODEL_NAME_OR_PATH
andOUTPUT_DIR
export CUDA_VISIBLE_DEVICES=0,1,2,3 # Set your CUDA BASE_DIR="/path/to/LEAD" # Root for the project TRAIN_FILE="/path/to/random_6k_data.jsonl" # path to training data(6k) randomly selected in the data pool MODEL_NAME_OR_PATH="/path/to/pretrained_model" # Dir to the pretrained model to finetune the warmup model OUTPUT_DIR="/path/to/save/warmup_model" # Dir to save the warmup model
-
Run the following script to train a warm up model for difficulty clustering
bash scripts/run_warmup_training.sh
Note: We only use 6k data in the data pool to train this model.
-
Edit
run_scoring.sh
to set your ownBASE_DIR
,MODEL_NAME_OR_PATH
andOUTPUT_DIR
export CUDA_VISIBLE_DEVICES=0,1,2,3 # Set your CUDA BASE_DIR="/path/to/LEAD" # Root for the project WARMUP_MODEL_PATH="/path/to/warmup_model" # Dir to the warmup model trained in Step 1 BASE_MODEL_PATH="/path/to/pretrained_model" # Dir to the pretrained model to get the initial iu score OUTPUT_DIR="/path/to/save/data_pool" # Dir to save the processed data pool to train lead
-
Run the following script to generate initial cluster training data. This script contains the following steps:
- Calculating Difficulty Scores
- Calculating Initial IU Scores
- Performing Clustering
bash scripts/run_scoring.sh
-
Edit
run_lead.sh
to set your ownBASE_DIR
,MODEL_NAME_OR_PATH
,NUM_GPUS
andOUTPUT_DIR
.export CUDA_VISIBLE_DEVICES=0,1,2,3 # Set your CUDA BASE_DIR="/path/to/LEAD" # Root for the project MODEL_NAME_OR_PATH="/path/to/model" # Dir to pretrained model to train lead NUM_GPUS=4 # Number of GPUs OUTPUT_DIR="/path/to/save/sft_model" # Dir to save sft model
-
Run the following script
bash scripts/run_lead.sh
We follow the instructions in the open-instruct folder to evaluate the performance of the model trained on the selected data.
-
Merge the lora model
bash scripts/lora_merge.sh
-
Evaluate diverse benchmarks
bash scripts/eval.sh
If you have any questions related to the code or the paper, feel free to contact: xlin420@connect.hkust-gz.edu.cn. If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!
If you find our work useful or inspiring, please kindly cite:
@misc{lin2025leaditerativedataselection,
title={LEAD: Iterative Data Selection for Efficient LLM Instruction Tuning},
author={Xiaotian Lin and Yanlin Qi and Yizhang Zhu and Themis Palpanas and Chengliang Chai and Nan Tang and Yuyu Luo},
year={2025},
eprint={2505.07437},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2505.07437},
}