Logic-RL-Lite is a lightweight replication study of the DeepSeek-R1-Zero framework. This project investigates the use of pure reinforcement learning (RL) without supervised fine-tuning (SFT) to post-train base models for reasoning capabilities. It is a follow-up of the Logic-RL project.
It leverages the following key components:
- RL Framework: verl
- RL Algorithms: REINFORCE++ and GRPO
- RL Dataset: Knights and Knaves (K&K) Logic Puzzle Dataset
- Base Models: Qwen2.5 (3B), Llama3.2 (3B)
Knights and Knaves (K&K) Logic Puzzle: Imagine there are two types of people: Knights and Knaves. Knights always tell the truth. Knaves always lie.
The K&K dataset is designed to test logical reasoning capabilities by presenting puzzles involving statements made by multiple "people," where the goal is to determine who is a knight and who is a knave based on the given clues.
- Format Reward: Yes
- Answer Reward: Yes
- Language Consistency Reward or Others: No
After configuring your WandB, GPUs, and other settings, execute the training:
bash run_rl_trainer_xxx.sh
For more visualized details, refer to my WandB report:
Logic-RL-Lite Training Report
Note: The findings may be specific to the experiment setups.
- The relationship between model scale and multi-step reasoning capability is discussed in this paper.
- 1.5B Models and Smaller
- Instruction-tuned or pretrained models cannot learn reasoning.
- 3B Models
- Instruction-tuned models: capable of learning reasoning.
- Pretrained models: mixed results — Llama3.2-3B struggles, while Qwen2.5-3B succeeds.
- 7B Models and Larger
- Consistently learn reasoning.
- Cognitive differences between Qwen2.5-3B and Llama3.2-3B are discussed in this paper.
- Qwen2.5-3B demonstrates stronger instruction-following behavior compared to Llama3.2-3B.
- Llama3.2-3B suffers from repetition.
- Self-reflection and rethinking behaviors appear at epoch 0 (or even step 0) in instruction-tuned base models.
- These behaviors likely stem from instruction tuning, rather than emergent properties of pure RL.
- See findings from OAT-ZERO and Logic-RL.
Table: Appearance of Self-Reflection, Verification and Summarization Keywords During Training (Base Model = Qwen2.5-3B-Instruct)
Word | First Occurrence (epoch, step) | Instances Found | Percentage (%) |
---|---|---|---|
rethink | N/A | 0 | 0.00 |
re-think | N/A | 0 | 0.00 |
think again | N/A | 0 | 0.00 |
retry | N/A | 0 | 0.00 |
re-try | N/A | 0 | 0.00 |
try again | N/A | 0 | 0.00 |
recheck | (0, 1) | 9 | 0.04 |
re-check | N/A | 0 | 0.00 |
check again | (0, 1) | 3 | 0.01 |
reevaluate | (0, 5) | 3 | 0.01 |
re-evaluate | (0, 4) | 34 | 0.15 |
double check | N/A | 0 | 0.00 |
double-check | N/A | 0 | 0.00 |
verify | (0, 0) | 83 | 0.37 |
summarize | (0, 0) | 73 | 0.33 |
summary | (0, 1) | 251 | 1.13 |
aha | N/A | 0 | 0.00 |
wait | N/A | 0 | 0.00 |
- Longer CoT does not consistently appear across different experiments.
- Longer CoT likely emerges only when the task is challenging, as the model may resort to memorization rather than true reasoning.
- Further experiments are required to validate this observation.
- While CoT becomes longer and the mean rewards increase, longer CoT does not correlate with higher accuracy.
- This aligns with superficial self-reflection findings from OAT-ZERO.
- Left Figure: Answer accuracy versus token count distribution.
- Right Figure: Regression analysis of accuracy against token count.
- Within
<think></think>
tags: Language mixing is more prevalent when the base model is instruction-tuned. This finding is counter-intuitive. - Outside
<think></think>
or<answer></answer>
tags: Language mixing is more prevalent when the base model is only pre-trained.
Category | Count | Percentage |
---|---|---|
Only English | 21636 | 96.73% |
Only Chinese | 0 | 0.00% |
Mixed (English & Chinese) | 511 | 2.28% |
- REINFORCE++ demonstrates greater stability compared to GRPO during training.
- Further experiments are required to validate this observation.
- For a technical comparison of REINFORCE++, GRPO, and PPO, see this report.
This project builds upon and references several open-source works:
- verl Framework: Reinforcement learning framework.
- Logic-RL: Reproduction of R1-Zero on logic puzzles.
- OAT-ZERO: Insights on reasoning with pure RL.
- TinyZero: Implementation of reward models and Countdown task.
- DeepScaler: Iterative context scaling with GRPO.
- Knights and Knaves (K&K) Puzzle Dataset: Logical reasoning tasks for LLMs.