Skip to content
/ LEAD Public

🐂 🔥Official repository for the paper "LEAD: Iterative Data Selection for Efficient LLM Instruction Tuning".

License

Notifications You must be signed in to change notification settings

HKUSTDial/LEAD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔥 LEAD: Iterative Data Selection for Efficient LLM Instruction Tuning

🚧 Please note that this repository is still under construction! 🚧

✨ Official repository for the paper "LEAD: Iterative Data Selection for Efficient LLM Instruction Tuning"

arXiv Data Python

Overview Figure

📢 News

[May 09, 2025] LEAD is publicly released!
[May 09, 2025] We released our data pool !

📋 Overview

Instruction tuning has emerged as a critical paradigm for improving the capabilities and alignment of large language models (LLMs). However, existing iterative model-aware data selection methods incur significant computational overhead, as they rely on repeatedly performing full-dataset model inference to estimate sample utility for subsequent training iterations, creating a fundamental efficiency bottleneck.

In this paper, we propose LEAD, an efficient iterative data selection framework that accurately estimates sample utility entirely within the standard training loop, eliminating the need for costly additional model inference. At its core, LEAD introduces Instance-Level Dynamic Uncertainty (IDU), a theoretically grounded utility function combining instantaneous training loss, gradient-based approximation of loss changes, and exponential smoothing of historical loss signals. To further scale efficiently to large datasets, LEAD employs a two-stage, coarse-to-fine selection strategy, adaptively prioritizing informative clusters through a multi-armed bandit mechanism, followed by precise fine-grained selection of high-utility samples using IDU. Extensive experiments across four diverse benchmarks show that LEAD significantly outperforms state-of-the-art methods, improving average model performance by 6.1%-10.8% while using only 2.5% of the training data and reducing overall training time by 5-10×.

Overview Figure

🔗 Quick Links

⚙️ Environment Setup

  1. Clone the repository

    git clone https://github.com/HKUSTDial/LEAD.git
    cd LEAD
  2. Create and activate a conda environment:

    conda create --name lead python=3.10.15
    conda activate lead
    pip install -r requirements.txt

📥 Data Preparation

We follow the open-instruct repo to prepare eight instruction tuning datasets. In our project, we utilize a combination of eight training datasets: WizardLM (ShareGPT) , WizardLM (Alpaca), UltraChat, unnatural, Alpaca code, Standard Alpaca, MATH and GSM8K.

A processed version of these files are available here.

🌠 Running LEAD

Step 1: Warmup Training

  1. Edit run_warmup_training.sh to set your own BASE_DIR, TRAIN_FILE, MODEL_NAME_OR_PATH and OUTPUT_DIR

    export CUDA_VISIBLE_DEVICES=0,1,2,3                 # Set your CUDA
    BASE_DIR="/path/to/LEAD"                            # Root for the project
    TRAIN_FILE="/path/to/random_6k_data.jsonl"          # path to training data(6k) randomly selected in the data pool
    MODEL_NAME_OR_PATH="/path/to/pretrained_model"      # Dir to the pretrained model to finetune the warmup model
    OUTPUT_DIR="/path/to/save/warmup_model"             # Dir to save the warmup model
    
  2. Run the following script to train a warm up model for difficulty clustering

    bash scripts/run_warmup_training.sh

Note: We only use 6k data in the data pool to train this model.

Step 2: Data Processing Pipeline (Offline)

  1. Edit run_scoring.sh to set your own BASE_DIR, MODEL_NAME_OR_PATH and OUTPUT_DIR

    export CUDA_VISIBLE_DEVICES=0,1,2,3              # Set your CUDA
    BASE_DIR="/path/to/LEAD"                         # Root for the project
    WARMUP_MODEL_PATH="/path/to/warmup_model"        # Dir to the warmup model trained in Step 1
    BASE_MODEL_PATH="/path/to/pretrained_model"      # Dir to the pretrained model to get the initial iu score
    OUTPUT_DIR="/path/to/save/data_pool"             # Dir to save the processed data pool to train lead
    
  2. Run the following script to generate initial cluster training data. This script contains the following steps:

    • Calculating Difficulty Scores
    • Calculating Initial IU Scores
    • Performing Clustering
    bash scripts/run_scoring.sh

Step 3: Two-Stage Coarse-to-Fine Data Selection and Training (Online)

  1. Edit run_lead.sh to set your own BASE_DIR, MODEL_NAME_OR_PATH, NUM_GPUS and OUTPUT_DIR.

    export CUDA_VISIBLE_DEVICES=0,1,2,3        # Set your CUDA
    BASE_DIR="/path/to/LEAD"                   # Root for the project
    MODEL_NAME_OR_PATH="/path/to/model"        # Dir to pretrained model to train lead
    NUM_GPUS=4                                 # Number of GPUs
    OUTPUT_DIR="/path/to/save/sft_model"       # Dir to save sft model
    
  2. Run the following script

    bash scripts/run_lead.sh

Step 4: Evaluation

We follow the instructions in the open-instruct folder to evaluate the performance of the model trained on the selected data.

  1. Merge the lora model

    bash scripts/lora_merge.sh
  2. Evaluate diverse benchmarks

    bash scripts/eval.sh

✉️ Contact

If you have any questions related to the code or the paper, feel free to contact: xlin420@connect.hkust-gz.edu.cn. If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!

📝 Citation

If you find our work useful or inspiring, please kindly cite:

@misc{lin2025leaditerativedataselection,
      title={LEAD: Iterative Data Selection for Efficient LLM Instruction Tuning}, 
      author={Xiaotian Lin and Yanlin Qi and Yizhang Zhu and Themis Palpanas and Chengliang Chai and Nan Tang and Yuyu Luo},
      year={2025},
      eprint={2505.07437},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2505.07437}, 
}

About

🐂 🔥Official repository for the paper "LEAD: Iterative Data Selection for Efficient LLM Instruction Tuning".

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •