Self-Play Fine-Tuning for LLM Agent

🌀 About SPIN

SPIN utilizes a self-play mechanism, allowing an LLM to improve itself by playing against its previous iterations, without needing additional human-annotated preference data than the SFT dataset itself. More specifically, the LLM generates its own training data from its previous iterations, refining its policy by discerning these self-generated responses from the original SFT data.

Average score of SPIN at different iterations on the HuggingFace Open LLM leaderboard.

SPIN can significantly enhance the performance of an LLM after SFT across various benchmarks, outperforming the model trained with direct preference optimization (DPO) on labelled preference datasets. The approach is theoretically grounded, ensuring that the LLM aligns with the target data distribution, and empirically validated through extensive evaluations on multiple datasets.

Performance comparison with DPO training across the six benchmark datasets. SPIN at iteration 0 achieves comparable performance to DPO training with 62k new data. At iteration 1, SPIN has already surpassed DPO training on the majority of datasets.

For more details, you can check our paper here.

Setup

The following steps provide the necessary setup to run our codes.

Create a Python virtual environment with Conda:

conda create -n spina python=3.10
conda activate spina

Install the following Python dependencies to run the codes.

python -m pip install .
python -m pip install flash-attn --no-build-isolation

Login to your huggingface account for downloading models

huggingface-cli login --token "${your_access_token}"

Data

Define tools in spin/tools/
Define tasks in spin/tasks/
Collect data & run experiments via spin/generation_fireact.py
Results will be saved in trajs/

Usage

For SPIN, we generate all synthetic data at once for an iteration, and fine-tune the LLM based on the real and synthetic data pairs.

Step 0 (optional): Reformatting SFT dataset

python spin/reformat.py [options]

Options

--data: directory to the SFT dataset (local or huggingface)
- default: HuggingFaceH4/ultrachat_200k
--output_dir: local directory to the reformated data files
- default: UCLA-AGI/SPIN_iter0

🔍 Note: If choosing to use SPIN on the entire dataset of HuggingFaceH4/ultrachat_200k instead of our 50k subset, one can reformat the original data with spin/reformat.py. To use other datasets, simply convert the data into the same format and resume with the following steps.

Step 1: Fine-tuning

Setup

Set up SERP API key and store in environment variable (see here)

export SERPAPI_API_KEY=<YOUR_KEY>

Example.

bash scripts/finetune.sh

Step 2: Generation

Begin to generate data

mkdir trajs # create the path for saving generated COT data
bash scripts/generate.sh

Options

--modelpath: load the base model checkpoint for generation.
- default: alignment-handbook/zephyr-7b-sft-full
--peftpath: the peft adapter model path
- default: outputs/iter0-ckpt

The generated data is in json format where each data contains the following attributes:

{
  "3687": {
    "reward": false,
    "em": false,
    "f1": 0,
    "gt": "beyond clouds",
    "pred": "fugitive",
    "prompt": "",
    "traj": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat movie did actress Irene Jacob complete before the American action crime thriller film directed by Stuart Bird?\n\n### Response:\nThought: I need to search for the movie Irene Jacob completed before the American action crime thriller film directed by Stuart Bird.\nAction: search[Irene Jacob movie before American action crime thriller film directed by Stuart Bird]\nObservation: The fugitive is only helped by his sweetheart ( Irene Jacob ). The picture is the following to \u00a8The fugitive\u00a8 ( by Andrew Davis ) that's an adaptation based on ...\nThought: The movie Irene Jacob completed before the American action crime thriller film directed by Stuart Bird is The Fugitive.\nAction: finish[The Fugitive]\nObservation: Episode finished, reward = False\n",
    "traj_by_line": [
      "Below is an instruction that describes a task. Write a response that appropriately completes the request.",
      "",
      "### Instruction:",
      "What movie did actress Irene Jacob complete before the American action crime thriller film directed by Stuart Bird?",
      "",
      "### Response:",
      "Thought: I need to search for the movie Irene Jacob completed before the American action crime thriller film directed by Stuart Bird.",
      "Action: search[Irene Jacob movie before American action crime thriller film directed by Stuart Bird]",
      "Observation: The fugitive is only helped by his sweetheart ( Irene Jacob ). The picture is the following to \u00a8The fugitive\u00a8 ( by Andrew Davis ) that's an adaptation based on ...",
      "Thought: The movie Irene Jacob completed before the American action crime thriller film directed by Stuart Bird is The Fugitive.",
      "Action: finish[The Fugitive]",
      "Observation: Episode finished, reward = False",
      ""
    ]
}

Convert the data into alpaca format

Transfer the generated file into alpaca format for subsequent fine-tuning.

python spin/convert_alpaca.py

🚀 Faster generation with vLLM

Alternatively, you could use the following example script to generate LLM responses with speedup. Larger frac_len can be used with vllm.

bash scripts/generate_vllm.sh

Thanks to @sumo43 for implementing vLLM for generation.

To be updated...

Reproducing Our Results

To help reproducing our results, we have made available the scripts corresponding to all four iterations of our study. These scripts are pre-configured with the exact parameters and model versions used in our paper. For each iteration, the base model is initialized with the version released on 🤗 HuggingFace, which can be found at the following links:

Dataset	Download
SPIN_iter0	🤗 HuggingFace
SPIN_iter1	🤗 HuggingFace
SPIN_iter2	🤗 HuggingFace
SPIN_iter3	🤗 HuggingFace

To execute the full pipeline using your locally trained models as the base, modify the model_name_or_path parameter in the configuration files to point to your model's path.

To start the full fine-tuning process, run the corresponding script from your terminal:

bash scripts/finetune.sh
bash scripts/finetune_iter1.sh
bash scripts/finetune_iter2.sh
bash scripts/finetune_iter3.sh

By following these steps, you should be able to reproduce our results.

Evaluation

For our evaluation on the Open LLM Leaderboard, please use the lm-evaluation-harness repository at v0.4.0. Also, note that we set the number of few shot examples to be the same as instructed on the Leaderboard. Different evaluation versions results in different scores, but the trend will remain the same.

Acknowledgement

This repo is built upon Self-Play Fine-Tuning (SPIN) and FireAct. We thank the authors for their great work.

Name		Name	Last commit message	Last commit date
Latest commit History 102 Commits
configs		configs
images		images
scripts		scripts
spin		spin
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Self-Play Fine-Tuning for LLM Agent

🌀 About SPIN

Setup

Data

Usage

Step 0 (optional): Reformatting SFT dataset

Step 1: Fine-tuning

Setup

Step 2: Generation

🚀 Faster generation with vLLM

To be updated...

Reproducing Our Results

Evaluation

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

License

ReinholdM/SPIN

Folders and files

Latest commit

History

Repository files navigation

Self-Play Fine-Tuning for LLM Agent

🌀 About SPIN

Setup

Data

Usage

Step 0 (optional): Reformatting SFT dataset

Step 1: Fine-tuning

Setup

Step 2: Generation

🚀 Faster generation with vLLM

To be updated...

Reproducing Our Results

Evaluation

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages