We have witnessed the powerful capabilities of pure RL-based LLM Reasoning. In this repository, we will add newest papers, slides, and other interesting materials that enhance LLM reasoning with reinforcement learning, helping everyone learn quickly!
Starring this repository is like being at the forefront of RL-based LLM reasoning.
在风口浪尖 (In the teeth of the storm)
- Why do we need reasoning?
- Why do we use reinforcement learning to get reasoning ability? (What are the advantages compared to reasoning methods that do not use reinforcement learning?)
- [2502] Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning (Shanghai AI Lab)
- [2502] Demystifying Long Chain-of-Thought Reasoning in LLMs (Introduced cosine length-scaling reward with repetition penalty for stable CoT length growth) (IN.AI)
- [2501] SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training (HKU, Berkeley)
- [2501] Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning (Deepseek)
- [2501] Kimi k1.5: Scaling Reinforcement Learning with LLMs (Kimi)
- [2502] S2 R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning (Tencent)
- [2502] Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling (THU)
- [2502] QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search (UCLA-Yizhou Sun)
- [2312] Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations (PKU & Deepseek)
- [2305] Let's verify step by step (OpenAI)
- [2211] Solving math word problems with process-and outcome-based feedback (DeepMind)
- [2503] SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning Tasks (agent & reasoning)
- [2502] Reasoning with Reinforced Functional Token Tuning
- [2503] DAST: Difficulty-Adaptive Slow-Thinking for Large Reasoning Models (short length of thinking by RL)
- [2503] L1: Controlling How Long A Reasoning Model Thinks With Reinforcement Learning (CMU)
- [2502] Provably Optimal Distributional RL for LLM Post-Training (Cornell, Harvard)
- [2502] On the Emergence of Thinking in LLMs I: Searching for the Right Intuition (Reinforcement Learning via Self-Play) (MIT)
- [2502] STP: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving (the scarcity of correct proofs sparse rewards will make performance quickly plateaus. To overcome this, we draw inspiration from mathematicians, who continuously develop new results, partly by proposing novel conjectures or exercises (which are often variants of known results) and attempting to solve them.) (Stanford-Tengyu Ma)
- [2409] Training Language Models to Self-Correct via Reinforcement Learning (DeepMind)
- [2502] Don’t Get Lost in the Trees: Streamlining LLM Reasoning by Overcoming Tree Search Exploration Pitfalls (Tencent)
- [2408] Deepseek-prover-v1. 5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search (DeepSeek)
- [2310] Solving olympiad geometry without human demonstrations (DeepMind)
- [2412] Formal Mathematical Reasoning: A New Frontier in AI
- [2503] Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models
- [2503] Interpreting the Repeated Token Phenomenon in Large Language Models (DeepMind)
- [2503] Attentive Reasoning Queries: A Systematic Method for Optimizing Instruction-Following in Large Language Models (Emcie Co Ltd)
- [2501] Reasoning Language Models: A Blueprint
- [2502] From System 1 to System 2: A Survey of Reasoning Large Language Models
- [2502] When More is Less: Understanding Chain-of-Thought Length in LLMs (I think is also about overthinking) (PKU, MIT)
- [2502] Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning (Meta-Yuandong Tian)
- [2502] CoT-Valve: Length-Compressible Chain-of-Thought Tuning (overthinking) (NUS)
- [2502] The Danger of Overthinking: Examining the Reasoning-Action Dilemma in Agentic Tasks (I think overthinking is a practical problem, interesting!) (Berkeley)
- [2502] ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates (Princeton)
- [2502] Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach (Current approaches to improving LM capabilities rely heavily on increasing model size or specialized prompting) (Max-Plank)
- [2502] LIMO: Less is More for Reasoning (LIMO offers a more principled and direct path through explicit trajectory design obtaining complex reasoning ability) (SJTU)
- [2502] Confidence Improves Self-Consistency in LLMs (the quality of LLM outputs) (Google Reasearch)
- [2502] LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! (UC Berkeley)
- [2502] BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation (Salesforce AI Research)
- [2502] LLMs Can Teach Themselves to Better Predict the Future (self-play generate data) (LSE)
- [2501] s1: Simple test-time scaling (Stanford)
- [2412] Efficiently Serving LLM Reasoning Programs with Certaindex (UCSD) (overthinking, probe in the middle)
- [2412] Training Large Language Model to Reason in a Continuous Latent Space (Meta-Yuandong Tian)
- [2412] Scaling of search and learning: A roadmap to reproduce o1 from reinforcement learning perspective
- [2408] Visual Agents as Fast and Slow Thinkers
- Self-improvement of LLM agents through Reinforcement Learning at Scale
- A Visual Guide to Reasoning LLMs
- Understanding Reasoning LLMs Methods and Strategies for Building and Refining Reasoning Models
- What is the difference between large reasoning model and LLM? (Zhihu)
- LLM Reasoning: Key Ideas and Limitations Denny Zhou-DeepMind (Video)
- Towards Reasoning in Large Language Models Jie Huang-UIUC
- Can LLMs Reason & Plan? Subbarao Kambhampati-ASU
- Inference-Time Techniques for LLM Reasoning Xinyun Chen-DeepMind
- Chain-of-Thought Reasoning In Language Models Zhuosheng Zhang-SJTU
- Learning to Self-Improve & Reason with LLMs Jason Weston-Meta & NYU
- 为什么在Deepseek-R1-ZERO出现前,无人尝试放弃微调对齐,通过强化学习生成思考链推理模型? Zhihu
- Kimi Flood Sung Zhihu
- Deepseek系列文章梳理 Zhihu
- ChatGPT and The Art of Post-Training Stanford-25/02/18
- [LLM+RL] R1 论文导读,SFT vs. RL,RL 基础以及 GRPO 细节,以及一系列复现工作讨论
- [LLM+RL] 理解 GRPO 公式原理及 TRL GrpoTrainer 代码实现(advantage 与 loss 计算)
- LLM-Based Reasoning: Opportunities and Pitfalls (LAVA Workshop in ACCV 2024)
- Reinforcement Learning in DeepSeek r1 Visualized (Chinese)
- EZ撸paper: DeepSeek-R1 论文详解 part 3:GPT发展史 | scaling law | 训练范式 | emergent ability
- EZ撸paper: DeepSeek-R1 论文详解 part 2:AGI是什么? | Reinforcement Learning快速入门 | AlphaGo介绍
- EZ撸paper: DeepSeek-R1 论文详解 part 1:比肩 OpenAI-o1,如何做到的?
- [GRPO Explained] DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
- DeepSeek R1 Explained to your grandma
TinyZero (4*4090 is enough for 0.5B LLM, but can't observe aha moment)
Open-r1
Logic-RL
Unsloth-GRPO (simplest r1 implementation)
OpenR (An Open Source Framework for Advanced Reasoning)
- DeepSeek-RL-Qwen-0.5B-GRPO-gsm8k
- deepseek_r1_train
The core essence of reinforcement learning is how an agent determines the next action within an environment to maximize the return; the environment’s role is to provide the state and reward.
- Q-learning (Value-based method): A threshold is set, and if the current value is greater than the threshold (epsilon-greddy), a random action is selected; if it is smaller, an action is chosen from the Q-table. Regardless of which method is chosen, the Q-table needs to be updated. After every action, we update the Q-table of the previous state to maximize the return.
- REINFORCE (Policy-based method): It’s like playing Mario where every action in a given playthrough is determined by a policy network. After the game ends, we have the reward for each state and can compute the cumulative return (G) for each state. Then, using this computed G, we calculate the loss and update the parameters of the policy network.
- [2501] (REINFORCE++) A Simple and Efficient Approach for Aligning Large Language Models 6 (REINFORCE++ is more stable in training compared to GRPO and faster than PPO in OpenRLHF report)
- [2407] A comprehensive survey of LLM alignment techniques: RLHF, RLAIF, PPO, DPO and more 10
- [2405] (SimPO) Simple Preference Optimization with a Reference-Free Reward 227
- [2402] (KTO) Model Alignment as Prospect Theoretic Optimization 326
- [2402] (GRPO) DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models 250
- [2305] (DPO) Direct Preference Optimization: Your Language Model is Secretly a Reward Model 2580
- [2203] (InstructGPT/PPO+LLM) Training language models to follow instructions with human feedback 12443
- [1707] (PPO) Proximal Policy Optimization Algorithms 23934
- [1706] (RLHF) Deep Reinforcement Learning from Human Preferences 3571
- Compshare (After registration, there is a quota of 50 yuan, enough to run R1 on unsloth)
- awesome-llm-reasoning-long2short-papers
- Awesome-Long2short-on-LRMs
- Awesome-Efficient-CoT-Reasoning-Summary
- Awesome RL-based Reasoning MLLMs
- DecryptPrompt (very comprehensive)
- Feel free to contribute more papers or other any resources!