Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐾 Process-supervised RM Trainer #2127

Merged
merged 140 commits into from
Dec 13, 2024
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
140 commits
Select commit Hold shift + click to select a range
357a8c6
initial skeleton
gaetanlop Sep 26, 2024
841f7a1
tokenize fn
gaetanlop Sep 26, 2024
641e899
adding bos and eos to tokenization fn
gaetanlop Sep 26, 2024
106bc0e
prmtrainer
gaetanlop Sep 27, 2024
0163dcc
fixing small typo in tokenize
gaetanlop Sep 27, 2024
c2720d7
typo in input_ids and labels construction
gaetanlop Sep 27, 2024
5034083
numpy dimension
gaetanlop Sep 27, 2024
8818b6a
introduce the stepwise reward trainer
gaetanlop Sep 28, 2024
b777d1c
update markdown files
gaetanlop Sep 28, 2024
afa9e0a
let user decide post step separator in config
gaetanlop Sep 28, 2024
2dd752d
doc post_step_separator
gaetanlop Sep 28, 2024
613d838
do not add post step_tokens to last step of the reasoning process
gaetanlop Sep 28, 2024
b96ef4d
renaming prm to stepwisereward
gaetanlop Sep 28, 2024
161f5de
formatting
gaetanlop Sep 28, 2024
93e6652
fix tokenize kwargs
gaetanlop Sep 28, 2024
3ec4ebe
adapt test to the new post_token args
gaetanlop Sep 28, 2024
1461a61
adding example script
gaetanlop Sep 28, 2024
8c4ac31
fix small typo
gaetanlop Sep 28, 2024
8b3fa52
add create_model_card and renaming
gaetanlop Oct 1, 2024
8e4e159
fixing booleans
gaetanlop Oct 1, 2024
c60bc40
Adding the new stepwise_preference instead of placeholders for datasets
gaetanlop Oct 1, 2024
614fb4e
formatting
gaetanlop Oct 1, 2024
c582464
Merge branch 'main' into prmtrainer
qgallouedec Oct 1, 2024
424af34
Merge branch 'main' into prmtrainer
kashif Oct 8, 2024
b00e32b
Update docs/source/_toctree.yml
gaetanlop Oct 12, 2024
d5f780a
Update examples/scripts/stepwise_reward_modeling.py
gaetanlop Oct 12, 2024
f02056a
Update trl/trainer/stepwise_reward_trainer.py
gaetanlop Oct 12, 2024
3ac323f
Update trl/trainer/stepwise_reward_trainer.py
gaetanlop Oct 12, 2024
436dfd7
update push to hub
gaetanlop Oct 12, 2024
f4e6d4e
step_separator can't be None
gaetanlop Oct 12, 2024
6947aef
Merge branch 'main' into prmtrainer
gaetanlop Oct 12, 2024
e0c0648
fix suggested typos
gaetanlop Oct 12, 2024
35de0ee
add citation
gaetanlop Oct 12, 2024
c3eb08e
reformat doc
gaetanlop Oct 12, 2024
898f621
reordering init
gaetanlop Oct 13, 2024
3a488e0
push to hub prm800k
gaetanlop Oct 13, 2024
a03aed8
changing dataset in example
gaetanlop Oct 13, 2024
e77eee2
change dataset format to align with the sky is blue example
gaetanlop Oct 13, 2024
6c62c69
Merge branch 'main' into prmtrainer
gaetanlop Oct 13, 2024
e8e93f1
fix tokenization column names
gaetanlop Oct 13, 2024
2059c51
fix num labels in openai example
gaetanlop Oct 13, 2024
701241b
add support for conversational dataset
gaetanlop Oct 13, 2024
6bb467b
remove training whitespace
gaetanlop Oct 13, 2024
6b2bd97
Merge branch 'main' into prmtrainer
gaetanlop Oct 14, 2024
2030a83
replace tokenizer with processing class
gaetanlop Oct 14, 2024
66baada
Merge branch 'prmtrainer' of https://github.com/gaetanlop/trl into pr…
gaetanlop Oct 14, 2024
b47eea5
Merge branch 'main' into prmtrainer
qgallouedec Nov 18, 2024
9b1693d
Merge branch 'main' into prmtrainer
gaetanlop Nov 24, 2024
086ea8f
Update docs/source/dataset_formats.mdx
gaetanlop Nov 24, 2024
fe440de
remove openai_prm800k
gaetanlop Nov 24, 2024
468502b
Update trl/trainer/stepwise_reward_trainer.py
gaetanlop Nov 24, 2024
d205064
Update trl/trainer/stepwise_reward_trainer.py
gaetanlop Nov 24, 2024
6128a7f
Merge branch 'prmtrainer' of https://github.com/gaetanlop/trl into pr…
gaetanlop Nov 24, 2024
faf1051
Update docs/source/stepwise_reward_trainer.mdx
gaetanlop Nov 24, 2024
dfe7e04
Update docs/source/stepwise_reward_trainer.mdx
gaetanlop Nov 24, 2024
fc702be
renaming
gaetanlop Nov 24, 2024
a65e30c
renaming
gaetanlop Nov 24, 2024
d53ad35
minor renamings in docs
gaetanlop Nov 24, 2024
24d2f1a
using prm800k instead of openai_prm800k
gaetanlop Nov 24, 2024
4fd282e
update num labels to 2 following the new format
gaetanlop Nov 24, 2024
2c9d2f3
changing doc examples to math examples
gaetanlop Nov 24, 2024
91a3de8
change reference to dataset_formats.mdx
gaetanlop Nov 24, 2024
97ef925
changing dataset config in test
gaetanlop Nov 24, 2024
754ba44
remove conversational dataset support
gaetanlop Nov 25, 2024
a7bac4e
remove conv dataset support
gaetanlop Nov 25, 2024
916f87e
fix bos token
gaetanlop Nov 25, 2024
364d7d8
fix scriptarguments in example
gaetanlop Nov 25, 2024
5a6970d
completion to completions
gaetanlop Nov 25, 2024
e445bad
remove valuerror for step_separator inside steps
gaetanlop Nov 25, 2024
fb15691
run precommit
gaetanlop Nov 25, 2024
1c76266
Merge branch 'main' into prmtrainer
gaetanlop Nov 25, 2024
9ae131a
Merge branch 'main' into prmtrainer
gaetanlop Nov 26, 2024
84c28fe
remove conv dataset support
gaetanlop Nov 26, 2024
16e4ef8
renaming zen dataset
gaetanlop Nov 26, 2024
147c375
remove unused printing
gaetanlop Nov 26, 2024
e310b0e
unknown label column
gaetanlop Nov 26, 2024
59f1e9f
introduce the train on last step arg
gaetanlop Nov 26, 2024
b057cf7
_tokenize support train_on_last_step
gaetanlop Nov 26, 2024
3a034d0
incorporate train_on_last_step to tests
gaetanlop Nov 26, 2024
8dce558
formatting
gaetanlop Nov 26, 2024
69adb5c
remove comments in trainer
gaetanlop Nov 26, 2024
be6e843
Refactor `tokenize_row`
qgallouedec Nov 26, 2024
e8c782d
Update max_completion_length parameter in StepwiseRewardConfig
qgallouedec Nov 26, 2024
4c83f41
Collator
qgallouedec Nov 26, 2024
a93138f
Update comment
qgallouedec Nov 26, 2024
072794a
Update type hint
qgallouedec Nov 26, 2024
5b10e38
fix table
qgallouedec Nov 26, 2024
5a8d0a2
Remove collator
qgallouedec Nov 26, 2024
f4ba54f
don't need pad token id
qgallouedec Nov 26, 2024
fd204d7
add error back
qgallouedec Nov 26, 2024
ebc8fb1
max length args
qgallouedec Nov 26, 2024
95a4a46
use tokenizer arg
qgallouedec Nov 26, 2024
46b6bd6
Update doc
qgallouedec Nov 26, 2024
201bdf2
label -> labels
qgallouedec Nov 26, 2024
4f28ed7
Merge pull request #1 from huggingface/prm-trainer-qgallouedec
gaetanlop Nov 27, 2024
0527531
Merge branch 'main' into prmtrainer
gaetanlop Nov 27, 2024
228aa31
fixing tokenization issues in tokenize row
gaetanlop Nov 27, 2024
aa33e62
correct labels for token classification
gaetanlop Nov 27, 2024
4cd0b79
adding max_length to tokenize_row
gaetanlop Nov 27, 2024
c58db4b
reformat tests
gaetanlop Nov 27, 2024
1385f46
adding tests for tokenize row
gaetanlop Nov 27, 2024
b2d45a8
fixing typos in comments
gaetanlop Nov 27, 2024
3d7d37d
update doc
gaetanlop Nov 28, 2024
ad3bd25
Add math_shepherd.py script for dataset processing
qgallouedec Nov 28, 2024
1cc6c8a
split the dataset
qgallouedec Nov 28, 2024
7273a3b
Merge pull request #2 from huggingface/prm-trainer-qgallouedec-2
gaetanlop Nov 29, 2024
b4e676b
Merge branch 'main' into prmtrainer
gaetanlop Nov 29, 2024
150500f
Merge branch 'main' into prmtrainer
qgallouedec Nov 29, 2024
30bb2c3
Merge branch 'main' into prmtrainer
gaetanlop Dec 1, 2024
32bb0b1
formatting
gaetanlop Dec 1, 2024
dec7bad
same evaluation method for the two training methods
gaetanlop Dec 2, 2024
e4fc400
adding filtering to example script
gaetanlop Dec 2, 2024
4ff8674
formatting
gaetanlop Dec 2, 2024
7787b98
Merge branch 'main' into prmtrainer
gaetanlop Dec 3, 2024
0d81c04
Merge branch 'main' into prmtrainer
qgallouedec Dec 9, 2024
049fdf9
Add features to avoid casting labels to bool in dataset tokenization
qgallouedec Dec 9, 2024
62b7465
Update docs/source/stepwise_reward_trainer.mdx [ci skip]
qgallouedec Dec 9, 2024
b62d74b
Add learning_rate parameter to StepwiseRewardConfig class
qgallouedec Dec 9, 2024
8d6a879
update doc
qgallouedec Dec 9, 2024
7da024c
Remove unused setup_chat_format function
qgallouedec Dec 9, 2024
c1f83ea
Fix warning message in stepwise_reward_modeling.py
qgallouedec Dec 9, 2024
a2d5837
Update logging steps in stepwise_reward_trainer.mdx
qgallouedec Dec 9, 2024
7146aff
little doc change [ci skip]
qgallouedec Dec 9, 2024
92be608
Merge branch 'main' into prmtrainer
qgallouedec Dec 10, 2024
ae677b1
Fix copyrights
qgallouedec Dec 10, 2024
7b88981
fix space after copyrights
qgallouedec Dec 10, 2024
c4faf19
Merge branch 'main' into prmtrainer
qgallouedec Dec 10, 2024
f164711
Update dataset loading in stepwise_reward_modeling.py
qgallouedec Dec 10, 2024
4572a21
refine compute_accuracy and proper test
qgallouedec Dec 10, 2024
75b50af
fix tests
qgallouedec Dec 10, 2024
2ebf9da
style
qgallouedec Dec 10, 2024
83e174e
Merge branch 'main' into prmtrainer
qgallouedec Dec 10, 2024
0d48cfa
Merge branch 'main' into prmtrainer
gaetanlop Dec 13, 2024
c4f6a62
renamings
gaetanlop Dec 13, 2024
81574f5
renaming in init
gaetanlop Dec 13, 2024
823825d
doc renaming
gaetanlop Dec 13, 2024
68e16f5
fix sorting and tag
qgallouedec Dec 13, 2024
9609ac8
experiemental [ci skip]
qgallouedec Dec 13, 2024
54011c9
trigger CI
qgallouedec Dec 13, 2024
686edfb
other doc fix
qgallouedec Dec 13, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,8 @@
title: RLOO
- local: sft_trainer
title: SFT
- local: stepwise_reward_trainer
gaetanlop marked this conversation as resolved.
Show resolved Hide resolved
title: StepwiseReward
- local: iterative_sft_trainer
title: Iterative SFT
- local: xpo_trainer
Expand Down
50 changes: 36 additions & 14 deletions docs/source/dataset_formats.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,19 @@ The *format* of a dataset refers to how the data is structured, typically catego
"label": False}</code></pre>
</td>
</tr>
<tr>
<td>Stepwise preference</td>
<td>
<pre><code>{"prompt": "Two apples and one orange cost 1.5 euros. Four apples and one orange cost 2.5 euros. What's the price of an apple?",
gaetanlop marked this conversation as resolved.
Show resolved Hide resolved
gaetanlop marked this conversation as resolved.
Show resolved Hide resolved
"stepwise_completion": ["Let a represent the price of an apple.", "Let b represent the price of an orange"],
"stepwise_labels": ["True", "True"]}</code></pre>
</td>
<td>
<pre><code>{"prompt": [{"role": "system", "content": "You are a very skilled mathematician."}, {"role": "user", "content": "Two apples and one orange cost 1.5 euros. Four apples and one orange cost 2.5 euros. What's the price of an apple?"}],
"stepwise_completion": ["Let a represent the price of an apple.", "Let b represent the price of an orange"],
gaetanlop marked this conversation as resolved.
Show resolved Hide resolved
"stepwise_labels": ["True", "True"]}</code></pre>
</td>
</tr>
</table>


Expand Down Expand Up @@ -188,24 +201,33 @@ An unpaired preference dataset is similar to a preference dataset but instead of
unpaired_preference_example = {"prompt": "The sky is", "completion": " blue.", "label": True}
```

### Stepwise preference

A stepwise preference dataset is similar to an unpaired preference dataset but instead of having a single `"completion"` and `"label"`, it includes a `"stepwise_completion"` column that splits the completion into a list of steps and a `"stepwise_labels"` indicating whether each step is correct or not.

```python
steps_preference_example = {"prompt": "Two apples and one orange cost 1.5 euros. Four apples and one orange cost 2.5 euros. What's the price of an apple?", "stepwise_completion": ["Let a represent the price of an apple.", "Let b represent the price of an orange"], "stepwise_labels": ["True", "True"]}
```

## Which dataset format to use?

Choosing the right dataset format depends on the task you are working on and the specific requirements of the TRL trainer you are using. Below is a brief overview of the dataset formats supported by each TRL trainer.

| Trainer | Expected dataset format |
| ----------------------- | ---------------------------- |
| [`BCOTrainer`] | Unpaired preference |
| [`CPOTrainer`] | Preference (explicit prompt) |
| [`DPOTrainer`] | Preference (explicit prompt) |
| [`IterativeSFTTrainer`] | Unpaired preference |
| [`KTOTrainer`] | Unpaired preference |
| [`NashMDTrainer`] | Prompt-only |
| [`OnlineDPOTrainer`] | Prompt-only |
| [`ORPOTrainer`] | Preference (explicit prompt) |
| [`PPOv2Trainer`] | Tokenized language modeling |
| [`RewardTrainer`] | Preference (implicit prompt) |
| [`SFTTrainer`] | Language modeling |
| [`XPOTrainer`] | Prompt-only |
| Trainer | Expected dataset format |
| ------------------------- | ---------------------------- |
| [`BCOTrainer`] | Unpaired preference |
| [`CPOTrainer`] | Preference (explicit prompt) |
| [`DPOTrainer`] | Preference (explicit prompt) |
| [`IterativeSFTTrainer`] | Unpaired preference |
| [`KTOTrainer`] | Unpaired preference |
| [`NashMDTrainer`] | Prompt-only |
| [`OnlineDPOTrainer`] | Prompt-only |
| [`ORPOTrainer`] | Preference (explicit prompt) |
| [`PPOv2Trainer`] | Tokenized language modeling |
| [`RewardTrainer`] | Preference (implicit prompt) |
| [`StepwiseRewardTrainer`] | Stepwise preference |
| [`SFTTrainer`] | Language modeling |
gaetanlop marked this conversation as resolved.
Show resolved Hide resolved
| [`XPOTrainer`] | Prompt-only |

<Tip>

Expand Down
56 changes: 56 additions & 0 deletions docs/source/stepwise_reward_trainer.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Stepwise Reward Modeling
gaetanlop marked this conversation as resolved.
Show resolved Hide resolved

TRL supports stepwise reward modeling (also known as process-supervised reward modeling or in short PRMs) to give feedback for each intermediate reasoning test. While the [`RewardTrainer`] trains a reward model only to score an entire solution, the [`StepwiseRewardTrainer`] trains a reward model to score each intermediate steps of the reasoning process.
Check out a complete example at [`examples/scripts/stepwise_reward_trainer.py`](https://github.com/huggingface/trl/tree/main/examples/scripts/stepwise_reward_modeling.py).

## Expected dataset format

The [`StepwiseRewardTrainer`] requires a [stepwise preference dataset](dataset_formats#stepwise-preference). It means that the dataset should contain the columns `prompt`, `stepwise_completion` and `stepwise_labels`.
The [`StepwiseRewardTrainer`] supports both [conversational](dataset_formats#conversational-dataset-format) and [standard](dataset_formats#standard-dataset-format) dataset format. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.

You can also use a pretokenized dataset, in which case the dataset should contain the following columns: `input_ids`, `attention_mask` and `labels`.

## Using the `StepwiseRewardTrainer`

After preparing your dataset, you can use the [`StepwiseRewardTrainer`] in the same way as the `Trainer` class from 🤗 Transformers.
You should pass an `AutoModelForTokenClassification` model to the [`StepwiseRewardTrainer`], along with a [`StepwiseRewardConfig`] which configures the hyperparameters of the training.

### Leveraging 🤗 PEFT to train a stepwise reward model

Just pass a `peft_config` in the keyword arguments of [`StepwiseRewardTrainer`], and the trainer should automatically take care of converting the model into a PEFT model!

```python
from peft import LoraConfig, TaskType
from transformers import AutoModelForTokenClassification, AutoTokenizer
from trl import StepwiseRewardTrainer, StepwiseRewardConfig

model = AutoModelForTokenClassification.from_pretrained("gpt2", num_labels=2)
peft_config = LoraConfig(
task_type=TaskType.TOKEN_CLS,
inference_mode=False,
r=8,
lora_alpha=32,
lora_dropout=0.1,
)

...

trainer = StepwiseRewardTrainer(
model=model,
args=training_args,
tokenizer=tokenizer,
train_dataset=dataset,
peft_config=peft_config,
)

trainer.train()

```

## StepwiseRewardTrainer

[[autodoc]] StepwiseRewardTrainer

## StepwiseRewardConfig

[[autodoc]] StepwiseRewardConfig
130 changes: 130 additions & 0 deletions examples/scripts/stepwise_reward_modeling.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Full training:
python examples/scripts/stepwise_reward_modeling.py \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--dataset_name trl-lib/PLACEHOLDER \
gaetanlop marked this conversation as resolved.
Show resolved Hide resolved
--output_dir Qwen2-0.5B-Reward \
--per_device_train_batch_size 8 \
--num_train_epochs 1 \
--gradient_checkpointing True \
--learning_rate 1.0e-5 \
--logging_steps 25 \
--eval_strategy steps \
--eval_steps 50 \
--max_length 2048

LoRA:
python examples/scripts/stepwise_reward_modeling.py \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have some compute, can you share some WandB logs from running these scripts? Otherwise I can run them myself :)

--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--dataset_name trl-lib/PLACEHOLDER \
--output_dir Qwen2-0.5B-Reward-LoRA \
--per_device_train_batch_size 8 \
--num_train_epochs 1 \
--gradient_checkpointing True \
--learning_rate 1.0e-4 \
--logging_steps 25 \
--eval_strategy steps \
--eval_steps 50 \
--max_length 2048 \
--use_peft \
--lora_r 32 \
--lora_alpha 16
"""

import warnings

import torch
from datasets import load_dataset
from transformers import AutoModelForTokenClassification, AutoTokenizer, HfArgumentParser

from trl import (
ModelConfig,
StepwiseRewardConfig,
StepwiseRewardTrainer,
get_kbit_device_map,
get_peft_config,
get_quantization_config,
setup_chat_format,
)
from trl.commands.cli_utils import RewardScriptArguments


if __name__ == "__main__":
parser = HfArgumentParser((RewardScriptArguments, StepwiseRewardConfig, ModelConfig))
args, training_args, model_config = parser.parse_args_into_dataclasses()
gaetanlop marked this conversation as resolved.
Show resolved Hide resolved
training_args.gradient_checkpointing_kwargs = dict(use_reentrant=False)

################
# Model & Tokenizer
################
torch_dtype = (
model_config.torch_dtype
if model_config.torch_dtype in ["auto", None]
else getattr(torch, model_config.torch_dtype)
)
quantization_config = get_quantization_config(model_config)
model_kwargs = dict(
revision=model_config.model_revision,
device_map=get_kbit_device_map() if quantization_config is not None else None,
quantization_config=quantization_config,
use_cache=False if training_args.gradient_checkpointing else True,
)
tokenizer = AutoTokenizer.from_pretrained(
model_config.model_name_or_path, trust_remote_code=model_config.trust_remote_code, use_fast=True
)
model = AutoModelForTokenClassification.from_pretrained(
model_config.model_name_or_path, num_labels=2, trust_remote_code=model_config.trust_remote_code, **model_kwargs
)
gaetanlop marked this conversation as resolved.
Show resolved Hide resolved
# Align padding tokens between tokenizer and model
model.config.pad_token_id = tokenizer.pad_token_id

# If post-training a base model, use ChatML as the default template
if tokenizer.chat_template is None:
model, tokenizer = setup_chat_format(model, tokenizer)
gaetanlop marked this conversation as resolved.
Show resolved Hide resolved

if model_config.use_peft and model_config.lora_task_type != "TOKEN_CLS":
warnings.warn(
"You are using a `task_type` that is different than `TOKEN_CLS` for PEFT. This will lead to silent bugs"
" Make sure to pass --lora_task_type TOKEN_CLS when using this script with PEFT."
)

##############
# Load dataset
##############
dataset = load_dataset(args.dataset_name)

##########
# Training
##########
trainer = StepwiseRewardTrainer(
model=model,
tokenizer=tokenizer,
args=training_args,
train_dataset=dataset[args.dataset_train_split],
eval_dataset=dataset[args.dataset_test_split],
peft_config=get_peft_config(model_config),
)
trainer.train()

############################
# Save model and push to Hub
############################
trainer.save_model(training_args.output_dir)
metrics = trainer.evaluate()
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)
trainer.save_model(training_args.output_dir)
trainer.push_to_hub()
gaetanlop marked this conversation as resolved.
Show resolved Hide resolved
Loading