Awesome-RL-based-LLM-Reasoning

We have witnessed the powerful capabilities of pure RL-based LLM Reasoning. In this repository, we will add newest papers, slides, and other interesting materials that enhance LLM reasoning with reinforcement learning, helping everyone learn quickly!
Starring this repository is like being at the forefront of RL-based LLM reasoning.
在风口浪尖 (In the teeth of the storm)

Why ?

Why do we need reasoning?
Why do we use reinforcement learning to get reasoning ability? (What are the advantages compared to reasoning methods that do not use reinforcement learning?)

Papers

Slides and Discussion

Self-improvement of LLM agents through Reinforcement Learning at Scale
A Visual Guide to Reasoning LLMs
Understanding Reasoning LLMs Methods and Strategies for Building and Refining Reasoning Models
What is the difference between large reasoning model and LLM? (Zhihu)
LLM Reasoning: Key Ideas and Limitations Denny Zhou-DeepMind (Video)
Towards Reasoning in Large Language Models Jie Huang-UIUC
Can LLMs Reason & Plan? Subbarao Kambhampati-ASU
Inference-Time Techniques for LLM Reasoning Xinyun Chen-DeepMind
Chain-of-Thought Reasoning In Language Models Zhuosheng Zhang-SJTU
Learning to Self-Improve & Reason with LLMs Jason Weston-Meta & NYU
为什么在Deepseek-R1-ZERO出现前，无人尝试放弃微调对齐，通过强化学习生成思考链推理模型？ Zhihu
Kimi Flood Sung Zhihu
Deepseek系列文章梳理 Zhihu
ChatGPT and The Art of Post-Training Stanford-25/02/18

Video

Open-Source Project

TinyZero (4*4090 is enough for 0.5B LLM, but can't observe aha moment)
Open-r1
Logic-RL
Unsloth-GRPO (simplest r1 implementation)
OpenR (An Open Source Framework for Advanced Reasoning)
DeepSeek-RL-Qwen-0.5B-GRPO-gsm8k
deepseek_r1_train

Introduction to Reinforcement Learning

The core essence of reinforcement learning is how an agent determines the next action within an environment to maximize the return; the environment’s role is to provide the state and reward.

Q-learning (Value-based method): A threshold is set, and if the current value is greater than the threshold (epsilon-greddy), a random action is selected; if it is smaller, an action is chosen from the Q-table. Regardless of which method is chosen, the Q-table needs to be updated. After every action, we update the Q-table of the previous state to maximize the return.
REINFORCE (Policy-based method): It’s like playing Mario where every action in a given playthrough is determined by a policy network. After the game ends, we have the reward for each state and can compute the cumulative return (G) for each state. Then, using this computed G, we calculate the loss and update the parameters of the policy network.

X_PO

[2501] (REINFORCE++) A Simple and Efficient Approach for Aligning Large Language Models 6 (REINFORCE++ is more stable in training compared to GRPO and faster than PPO in OpenRLHF report)
[2407] A comprehensive survey of LLM alignment techniques: RLHF, RLAIF, PPO, DPO and more 10
[2405] (SimPO) Simple Preference Optimization with a Reference-Free Reward 227
[2402] (KTO) Model Alignment as Prospect Theoretic Optimization 326
[2402] (GRPO) DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models 250
[2305] (DPO) Direct Preference Optimization: Your Language Model is Secretly a Reward Model 2580
[2203] (InstructGPT/PPO+LLM) Training language models to follow instructions with human feedback 12443
[1707] (PPO) Proximal Policy Optimization Algorithms 23934
[1706] (RLHF) Deep Reinforcement Learning from Human Preferences 3571

Cloud GPU

Compshare (After registration, there is a quota of 50 yuan, enough to run R1 on unsloth)

Other Interesting RL-based Reasoning Repository

Contributing

Feel free to contribute more papers or other any resources!

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome-RL-based-LLM-Reasoning

Why ?

Papers

Outcome-based Reward Model

Process-based Reward Model

Reinforcement learning

Search algorithms (Monte Carlo Tree Search or Beam Search)

Other Newest Interesting Papers about LLM Reasoning

Slides and Discussion

Video

Open-Source Project

Introduction to Reinforcement Learning

X_PO

Cloud GPU

Other Interesting RL-based Reasoning Repository

Contributing

About

Releases

Packages

Contributors 2

bruno686/Awesome-RL-based-LLM-Reasoning

Folders and files

Latest commit

History

Repository files navigation

Awesome-RL-based-LLM-Reasoning

Why ?

Papers

Outcome-based Reward Model

Process-based Reward Model

Reinforcement learning

Search algorithms (Monte Carlo Tree Search or Beam Search)

Other Newest Interesting Papers about LLM Reasoning

Slides and Discussion

Video

Open-Source Project

Introduction to Reinforcement Learning

X_PO

Cloud GPU

Other Interesting RL-based Reasoning Repository

Contributing

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Packages