Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs https://arxiv.org/pdf/2402.14740 RLHF Workflow: From Reward Modeling to Online RLHF https://arxiv.org/pdf/2405.07863 The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization https://arxiv.org/pdf/2403.17031 rlhf deciphered: a critical analysis of reinforcement learning from human feedback for llms https://arxiv.org/pdf/2404.08555 From r to Q∗: Your Language Model is Secretly a Q-Function https://arxiv.org/pdf/2404.12358 Alignment Guidebook https://www.notion.so/Alignment-Guidebook-7e3b4d925bd5431baab9b5490b84269a