Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

https://arxiv.org/pdf/2402.14740

RLHF Workflow: From Reward Modeling to Online RLHF

https://arxiv.org/pdf/2405.07863

The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization

https://arxiv.org/pdf/2403.17031

rlhf deciphered: a critical analysis of reinforcement learning from human feedback for llms

https://arxiv.org/pdf/2404.08555

From r to Q∗: Your Language Model is Secretly a Q-Function

https://arxiv.org/pdf/2404.12358

Alignment Guidebook

https://www.notion.so/Alignment-Guidebook-7e3b4d925bd5431baab9b5490b84269a