You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/example_overview.md
+1-2Lines changed: 1 addition & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -56,8 +56,7 @@ Scripts can be used as examples of how to use TRL trainers. They are located in
56
56
|[`examples/scripts/ppo/ppo_tldr.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/ppo/ppo_tldr.py)| This script shows how to use the [`PPOTrainer`] to fine-tune a model to improve its ability to generate TL;DR summaries. |
57
57
|[`examples/scripts/prm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/prm.py)| This script shows how to use the [`PRMTrainer`] to fine-tune a Process-supervised Reward Model (PRM). |
58
58
|[`examples/scripts/reward_modeling.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/reward_modeling.py)| This script shows how to use the [`RewardTrainer`] to train a Outcome Reward Model (ORM) on your own dataset. |
59
-
|[`examples/scripts/rloo/rloo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/rloo/rloo.py)| This script shows how to use the [`RLOOTrainer`] to fine-tune a model to improve its ability to continue text with positive sentiment or physically descriptive language. |
60
-
|[`examples/scripts/rloo/rloo_tldr.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/rloo/rloo_tldr.py)| This script shows how to use the [`RLOOTrainer`] to fine-tune a model to improve its ability to generate TL;DR summaries. |
59
+
|[`examples/scripts/rloo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/rloo.py)| This script shows how to use the [`RLOOTrainer`] to fine-tune a model to improve its ability to solve math questions. |
61
60
|[`examples/scripts/sft.py`](https://github.com/huggingface/trl/blob/main/trl/scripts/sft.py)| This script shows how to use the [`SFTTrainer`] to fine-tune a model. |
62
61
|[`examples/scripts/sft_gemma3.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/sft_gemma3.py)| This script shows how to use the [`SFTTrainer`] to fine-tune a Gemma 3 model. |
63
62
|[`examples/scripts/sft_video_llm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/sft_video_llm.py)| This script shows how to use the [`SFTTrainer`] to fine-tune a Video Language Model. |
Copy file name to clipboardExpand all lines: docs/source/grpo_trainer.md
+10-8Lines changed: 10 additions & 8 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,10 +14,10 @@ This post-training method was contributed by [Quentin Gallouédec](https://huggi
14
14
15
15
## Quick start
16
16
17
-
This example demonstrates how to train a model using the GRPO method. We train a [Qwen 0.5B Instruct model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) with the prompts from the [TLDR dataset](https://huggingface.co/datasets/trl-lib/tldr) (completion column is ignored!). You can view the data in the dataset here:
17
+
This example demonstrates how to train a model using the GRPO method. We train a [Qwen 0.5B Instruct model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) with the prompts from the [UltraFeedback prompts dataset](https://huggingface.co/datasets/trl-lib/ultrafeedback-prompt). You can view the data in the dataset here:
@@ -68,7 +70,7 @@ At each training step, we sample a batch of prompts and generate a set of \\( G
68
70
69
71
### Computing the advantage
70
72
71
-
For each of the \\( G \\) sequences, we compute the reward using a reward model. To align with the comparative nature of reward models—typically trained on datasets of comparisons between outputs for the same question—the advantage is calculated to reflect these relative comparisons. It is normalized as follows:
73
+
For each of the \\( G \\) sequences, we compute the reward using a reward model or reward function. To align with the comparative nature of reward models—typically trained on datasets of comparisons between outputs for the same question—the advantage is calculated to reflect these relative comparisons. It is normalized as follows:
RLOO is a variant of REINFORCE that reduces variance by using leave-one-out baselines. It computes rewards by comparing each sample against the average of all other samples in the batch, providing more stable gradients than standard REINFORCE. To reproduce the paper's setting, use this configuration:
111
+
112
+
```python
113
+
from trl import RLOOConfig
114
+
115
+
training_args = RLOOConfig(
116
+
per_device_train_batch_size=512, # section C Training Detail of the paper
117
+
steps_per_generation=2# section C Training Detail of the paper
118
+
beta=0.03# section C Training Detail of the paper
119
+
num_generations=2, # experiments of paper different num_generations={2,4}
120
+
learning_rate=1e-6# section C Training Detail of the paper
121
+
)
122
+
```
123
+
106
124
## AlphaPO -- Reward shape matters for LLM alignment
0 commit comments