GitHub - SafeRoboticsLab/RLHS: RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation

RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation

Official code for the paper "RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation".

Authors: Kaiqu Liang, Haimin Hu, Ryan Liu, Tom Griffiths, Jaime Fernández Fisac.

Abstract

While Reinforcement Learning from Human Feedback (RLHF) has shown promise in aligning generative AI, we present empirical evidence that it can also cause severe, systematic misalignment. We hypothesize that this stems from evaluator feedback depending on downstream outcome predictions (foresight) that can be influenced by the AI's output, inducing Goodhart’s law dynamics. We present a theoretical analysis showing that conditioning evaluator feedback on downstream observations (hindsight) inhibits this effect by decoupling the alignment signal from potentially compromised predictions---crucially, the result holds even if the observed outcomes are sampled from the AI's own world model. Building on this insight, we introduce Reinforcement Learning from Hindsight Simulation (RLHS), which presents plausible simulated outcomes to evaluators before eliciting feedback. We validate RLHS across three consultancy settings---marketplace interactions, restaurant recommendations, and online course advising---using both online (PPO) and offline (DPO) fine-tuning methods, and show that it substantially improves alignment over RLHF in experiments and human evaluations. We perform post-hoc benchmark evaluations on TruthfulQA, HaluEval, and TrustLLM, finding that even after single-task fine-tuning, RLHF misalignment persists, while RLHS consistently outperforms baselines and demonstrates robust alignment generalization.

Setup

pip install -r requirements.txt

Preference data generation

python main.py --ai_model llama-2-7b llama-3-8b --human_model llama-3.1-70b --index 1

Combine and convert the data (Optional)

python comb_data.py

Finetune (RLHF, PHS, and FHS)

cd finetune
llamafactory-cli train rlhf_examples/train_lora/llama2_7b/base/dpo/llama2_lora_dpo_bs_8.yaml

Merge

llamafactory-cli export rlhf_examples/merge_lora/llama-2-7b/base/dpo/checkpoint-3000.yaml

Inference

mkdir -p test_data/llama2_7b/base/dpo_bs_8

python main.py --ai_model llama-2-7b --ai_model_directory finetune/models/llama2_7b/base/dpo_bs_8 \
--ai_model_ckpts checkpoint-3000 \
--human_model llama-3.1-70b --test_data test_data/test_marketplace.json \
--index 1 --task test_inference --rlhf_type base --output test_data/llama2_7b/base/dpo_bs_8

Citation

If you find this code to be useful for your research, please consider citing.

@article{liang2025rlhs,
  title={Rlhs: Mitigating misalignment in rlhf with hindsight simulation},
  author={Liang, Kaiqu and Hu, Haimin and Liu, Ryan and Griffiths, Thomas L and Fisac, Jaime Fern{\'a}ndez},
  journal={arXiv preprint arXiv:2501.08617},
  year={2025}
}

Acknowledgements

We fine-tuned our model using LLaMA-Factory. We thank its authors for making their work open source. The fine-tuning phase can also be readily adapted to other platforms.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
course		course
figs		figs
finetune		finetune
marketplace		marketplace
restaurant		restaurant
test_data		test_data
LICENSE		LICENSE
README.md		README.md
chatbot.py		chatbot.py
comb_data.py		comb_data.py
environment.py		environment.py
llm.py		llm.py
main.py		main.py
prompt_manager.py		prompt_manager.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation

Abstract

Setup

Preference data generation

Combine and convert the data (Optional)

Finetune (RLHF, PHS, and FHS)

Merge

Inference

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

SafeRoboticsLab/RLHS

Folders and files

Latest commit

History

Repository files navigation

RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation

Abstract

Setup

Preference data generation

Combine and convert the data (Optional)

Finetune (RLHF, PHS, and FHS)

Merge

Inference

Citation

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages