Official code for the paper "RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation".
Authors: Kaiqu Liang, Haimin Hu, Ryan Liu, Tom Griffiths, Jaime Fernández Fisac.
While Reinforcement Learning from Human Feedback (RLHF) has shown promise in aligning generative AI, we present empirical evidence that it can also cause severe, systematic misalignment. We hypothesize that this stems from evaluator feedback depending on downstream outcome predictions (foresight) that can be influenced by the AI's output, inducing Goodhart’s law dynamics. We present a theoretical analysis showing that conditioning evaluator feedback on downstream observations (hindsight) inhibits this effect by decoupling the alignment signal from potentially compromised predictions---crucially, the result holds even if the observed outcomes are sampled from the AI's own world model. Building on this insight, we introduce Reinforcement Learning from Hindsight Simulation (RLHS), which presents plausible simulated outcomes to evaluators before eliciting feedback. We validate RLHS across three consultancy settings---marketplace interactions, restaurant recommendations, and online course advising---using both online (PPO) and offline (DPO) fine-tuning methods, and show that it substantially improves alignment over RLHF in experiments and human evaluations. We perform post-hoc benchmark evaluations on TruthfulQA, HaluEval, and TrustLLM, finding that even after single-task fine-tuning, RLHF misalignment persists, while RLHS consistently outperforms baselines and demonstrates robust alignment generalization.
pip install -r requirements.txt
python main.py --ai_model llama-2-7b llama-3-8b --human_model llama-3.1-70b --index 1
python comb_data.py
cd finetune
llamafactory-cli train rlhf_examples/train_lora/llama2_7b/base/dpo/llama2_lora_dpo_bs_8.yaml
llamafactory-cli export rlhf_examples/merge_lora/llama-2-7b/base/dpo/checkpoint-3000.yaml
mkdir -p test_data/llama2_7b/base/dpo_bs_8
python main.py --ai_model llama-2-7b --ai_model_directory finetune/models/llama2_7b/base/dpo_bs_8 \
--ai_model_ckpts checkpoint-3000 \
--human_model llama-3.1-70b --test_data test_data/test_marketplace.json \
--index 1 --task test_inference --rlhf_type base --output test_data/llama2_7b/base/dpo_bs_8
If you find this code to be useful for your research, please consider citing.
@article{liang2025rlhs, title={Rlhs: Mitigating misalignment in rlhf with hindsight simulation}, author={Liang, Kaiqu and Hu, Haimin and Liu, Ryan and Griffiths, Thomas L and Fisac, Jaime Fern{\'a}ndez}, journal={arXiv preprint arXiv:2501.08617}, year={2025} }
We fine-tuned our model using LLaMA-Factory. We thank its authors for making their work open source. The fine-tuning phase can also be readily adapted to other platforms.