This repository is for our work "References Improve LLM Alignment in Non-Verifiable Domains."
Please run pip install -r requirements.txt to install the required packages.
For training, you will need at least 8 GPUs with 48GB of memory each. The code is tested on a machine with 8 NVIDIA A6000 Ada GPUs.
To run the self-improvement training experiment, please use the following command: bash self_improve.sh.
The script self_improve.sh performs preference optimization using LLM-judge to self-improve with the following steps:
- Sampling candidate outputs from the LLM.
- Scoring the candidate outputs using the model itself as a judge.
- Data processing and precomputing the log probabilities of the output pairs.
- Training: traning the LLM using DPO.
self_improve.sh: Script for running the self-improvement training experiment.data_processing.py: Contains the code for post-processing the preference model annotations into training data.data_utils.py: Utility functions for training data loading.get_logprobs.py: Script for extracting log probabilities from an LLM/policy.losses.py: Loss functions.dpo.py: DPO training.mle.py: MLE training.sampling.py: Sampling candidate outputs from an LLM.scoring.py: Scoring output pairs using a preference model.utils.py: Utility functions.vllm_model.py: VLLM model definition.- `deepspeed.conf': Deepspeed configuration file.