Policy Optimization in RLHF: The Impact of Out-of-preference Data

This repository contains the code to reproduce experiments in the paper: Policy Optimization in RLHF: The Impact of Out-of-preference Data.

The experiments show that policy optimization with out-of-preference data is key to unlocking the reward model's generalization power.

How to use

Prepare

The Python environment can be set up using Anaconda with the provided environment.yml file.

conda env create -f environment.yml
conda activate bandit

Linear Bandit

bash scripts/run_linear_bandit.sh

Neural Bandit

bash scripts/run_neural_bandit.sh

Bibtex

If you find this code is helpful, please cite our paper in the following format.

@article{li2023policy,
  title     = {Policy Optimization in RLHF: The Impact of Out-of-preference Data},
  author    = {Li, Ziniu and Xu, Tian and Yu, Yang},
  journal   = {arXiv preprint arXiv:2312.10584},
  year      = {2023},
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
algos		algos
envs		envs
exp		exp
experiments		experiments
images		images
logs		logs
scripts		scripts
utils		utils
.gitignore		.gitignore
README.md		README.md
debug_cyclic_yt.ipynb		debug_cyclic_yt.ipynb
environment.yml		environment.yml
exp.ipynb		exp.ipynb
model_accuracies.png		model_accuracies.png
output.png		output.png
prob_rlhf.png		prob_rlhf.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Policy Optimization in RLHF: The Impact of Out-of-preference Data

How to use

Prepare

Linear Bandit

Neural Bandit

Bibtex

About

Uh oh!

Releases

Packages

Languages

Katie-zhang/policy_optimization

Folders and files

Latest commit

History

Repository files navigation

Policy Optimization in RLHF: The Impact of Out-of-preference Data

How to use

Prepare

Linear Bandit

Neural Bandit

Bibtex

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages