huggingface · qgallouedec · Dec 13, 2024 · Sep 26, 2024 · Sep 26, 2024 · Sep 26, 2024
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
@@ -50,6 +50,8 @@
       title: RLOO
     - local: sft_trainer
       title: SFT
+    - local: stepwise_reward_trainer
+      title: StepwiseReward
     - local: iterative_sft_trainer
       title: Iterative SFT
     - local: xpo_trainer

diff --git a/docs/source/dataset_formats.mdx b/docs/source/dataset_formats.mdx
@@ -76,6 +76,19 @@ The *format* of a dataset refers to how the data is structured, typically catego
  "label": False}</code></pre>
     </td>
   </tr>
+  <tr>
+    <td>Stepwise preference</td>
+    <td>
+      <pre><code>{"prompt": "Two apples and one orange cost 1.5 euros. Four apples and one orange cost 2.5 euros. What's the price of an apple?",
+ "stepwise_completion": ["Let a represent the price of an apple.", "Let b represent the price of an orange"],
+ "stepwise_labels": ["True", "True"]}</code></pre>
+    </td>
+    <td>
+      <pre><code>{"prompt": [{"role": "system", "content": "You are a very skilled mathematician."}, {"role": "user", "content": "Two apples and one orange cost 1.5 euros. Four apples and one orange cost 2.5 euros. What's the price of an apple?"}],
+ "stepwise_completion": ["Let a represent the price of an apple.", "Let b represent the price of an orange"],
+ "stepwise_labels": ["True", "True"]}</code></pre>
+    </td>
+  </tr>
 </table>
 
 
@@ -188,24 +201,33 @@ An unpaired preference dataset is similar to a preference dataset but instead of
 unpaired_preference_example = {"prompt": "The sky is", "completion": " blue.", "label": True}
 ```
 
+### Stepwise preference
+
+A stepwise preference dataset is similar to an unpaired preference dataset but instead of having a single `"completion"` and `"label"`, it includes a `"stepwise_completion"` column that splits the completion into a list of steps and a `"stepwise_labels"` indicating whether each step is correct or not.
+
+```python
+steps_preference_example = {"prompt": "Two apples and one orange cost 1.5 euros. Four apples and one orange cost 2.5 euros. What's the price of an apple?", "stepwise_completion": ["Let a represent the price of an apple.", "Let b represent the price of an orange"], "stepwise_labels": ["True", "True"]}
+```
+
 ## Which dataset format to use?
 
 Choosing the right dataset format depends on the task you are working on and the specific requirements of the TRL trainer you are using. Below is a brief overview of the dataset formats supported by each TRL trainer.
 
-| Trainer                 | Expected dataset format      |
-| ----------------------- | ---------------------------- |
-| [`BCOTrainer`]          | Unpaired preference          |
-| [`CPOTrainer`]          | Preference (explicit prompt) |
-| [`DPOTrainer`]          | Preference (explicit prompt) |
-| [`IterativeSFTTrainer`] | Unpaired preference          |
-| [`KTOTrainer`]          | Unpaired preference          |
-| [`NashMDTrainer`]       | Prompt-only                  |
-| [`OnlineDPOTrainer`]    | Prompt-only                  |
-| [`ORPOTrainer`]         | Preference (explicit prompt) |
-| [`PPOv2Trainer`]        | Tokenized language modeling  |
-| [`RewardTrainer`]       | Preference (implicit prompt) |
-| [`SFTTrainer`]          | Language modeling            |
-| [`XPOTrainer`]          | Prompt-only                  |
+| Trainer                   | Expected dataset format      |
+| ------------------------- | ---------------------------- |
+| [`BCOTrainer`]            | Unpaired preference          |
+| [`CPOTrainer`]            | Preference (explicit prompt) |
+| [`DPOTrainer`]            | Preference (explicit prompt) |
+| [`IterativeSFTTrainer`]   | Unpaired preference          |
+| [`KTOTrainer`]            | Unpaired preference          |
+| [`NashMDTrainer`]         | Prompt-only                  |
+| [`OnlineDPOTrainer`]      | Prompt-only                  |
+| [`ORPOTrainer`]           | Preference (explicit prompt) |
+| [`PPOv2Trainer`]          | Tokenized language modeling  |
+| [`RewardTrainer`]         | Preference (implicit prompt) |
+| [`StepwiseRewardTrainer`] | Stepwise preference             |
+| [`SFTTrainer`]            | Language modeling            |
+| [`XPOTrainer`]            | Prompt-only                  |
 
 <Tip>
 

diff --git a/docs/source/stepwise_reward_trainer.mdx b/docs/source/stepwise_reward_trainer.mdx
@@ -0,0 +1,56 @@
+# Stepwise Reward Modeling
+
+TRL supports stepwise reward modeling (also known as process-supervised reward modeling or in short PRMs) to give feedback for each intermediate reasoning test. While the [`RewardTrainer`] trains a reward model only to score an entire solution, the [`StepwiseRewardTrainer`] trains a reward model to score each intermediate steps of the reasoning process.
+Check out a complete example at [`examples/scripts/stepwise_reward_trainer.py`](https://github.com/huggingface/trl/tree/main/examples/scripts/stepwise_reward_modeling.py).
+
+## Expected dataset format
+
+The [`StepwiseRewardTrainer`] requires a [stepwise preference dataset](dataset_formats#stepwise-preference). It means that the dataset should contain the columns `prompt`, `stepwise_completion` and `stepwise_labels`.
+The [`StepwiseRewardTrainer`] supports both [conversational](dataset_formats#conversational-dataset-format) and [standard](dataset_formats#standard-dataset-format) dataset format. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
+
+You can also use a pretokenized dataset, in which case the dataset should contain the following columns: `input_ids`, `attention_mask` and `labels`.
+
+## Using the `StepwiseRewardTrainer`
+
+After preparing your dataset, you can use the [`StepwiseRewardTrainer`] in the same way as the `Trainer` class from 🤗 Transformers.
+You should pass an `AutoModelForTokenClassification` model to the [`StepwiseRewardTrainer`], along with a [`StepwiseRewardConfig`] which configures the hyperparameters of the training.
+
+### Leveraging 🤗 PEFT to train a stepwise reward model
+
+Just pass a `peft_config` in the keyword arguments of [`StepwiseRewardTrainer`], and the trainer should automatically take care of converting the model into a PEFT model!
+
+```python
+from peft import LoraConfig, TaskType
+from transformers import AutoModelForTokenClassification, AutoTokenizer
+from trl import StepwiseRewardTrainer, StepwiseRewardConfig
+
+model = AutoModelForTokenClassification.from_pretrained("gpt2", num_labels=2)
+peft_config = LoraConfig(
+    task_type=TaskType.TOKEN_CLS,
+    inference_mode=False,
+    r=8,
+    lora_alpha=32,
+    lora_dropout=0.1,
+)
+
+...
+
+trainer = StepwiseRewardTrainer(
+    model=model,
+    args=training_args,
+    tokenizer=tokenizer,
+    train_dataset=dataset,
+    peft_config=peft_config,
+)
+
+trainer.train()
+
+```
+
+## StepwiseRewardTrainer
+
+[[autodoc]] StepwiseRewardTrainer
+
+## StepwiseRewardConfig
+
+[[autodoc]] StepwiseRewardConfig
diff --git a/examples/scripts/stepwise_reward_modeling.py b/examples/scripts/stepwise_reward_modeling.py
@@ -0,0 +1,130 @@
+# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Full training:
+python examples/scripts/stepwise_reward_modeling.py \
+    --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
+    --dataset_name trl-lib/PLACEHOLDER \
+    --output_dir Qwen2-0.5B-Reward \
+    --per_device_train_batch_size 8 \
+    --num_train_epochs 1 \
+    --gradient_checkpointing True \
+    --learning_rate 1.0e-5 \
+    --logging_steps 25 \
+    --eval_strategy steps \
+    --eval_steps 50 \
+    --max_length 2048 
+
+LoRA:
+python examples/scripts/stepwise_reward_modeling.py \
+    --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
+    --dataset_name trl-lib/PLACEHOLDER \
+    --output_dir Qwen2-0.5B-Reward-LoRA \
+    --per_device_train_batch_size 8 \
+    --num_train_epochs 1 \
+    --gradient_checkpointing True \
+    --learning_rate 1.0e-4 \
+    --logging_steps 25 \
+    --eval_strategy steps \
+    --eval_steps 50 \
+    --max_length 2048 \
+    --use_peft \
+    --lora_r 32 \
+    --lora_alpha 16
+"""
+
+import warnings
+
+import torch
+from datasets import load_dataset
+from transformers import AutoModelForTokenClassification, AutoTokenizer, HfArgumentParser
+
+from trl import (
+    ModelConfig,
+    StepwiseRewardConfig,
+    StepwiseRewardTrainer,
+    get_kbit_device_map,
+    get_peft_config,
+    get_quantization_config,
+    setup_chat_format,
+)
+from trl.commands.cli_utils import RewardScriptArguments
+
+
+if __name__ == "__main__":
+    parser = HfArgumentParser((RewardScriptArguments, StepwiseRewardConfig, ModelConfig))
+    args, training_args, model_config = parser.parse_args_into_dataclasses()
+    training_args.gradient_checkpointing_kwargs = dict(use_reentrant=False)
+
+    ################
+    # Model & Tokenizer
+    ################
+    torch_dtype = (
+        model_config.torch_dtype
+        if model_config.torch_dtype in ["auto", None]
+        else getattr(torch, model_config.torch_dtype)
+    )
+    quantization_config = get_quantization_config(model_config)
+    model_kwargs = dict(
+        revision=model_config.model_revision,
+        device_map=get_kbit_device_map() if quantization_config is not None else None,
+        quantization_config=quantization_config,
+        use_cache=False if training_args.gradient_checkpointing else True,
+    )
+    tokenizer = AutoTokenizer.from_pretrained(
+        model_config.model_name_or_path, trust_remote_code=model_config.trust_remote_code, use_fast=True
+    )
+    model = AutoModelForTokenClassification.from_pretrained(
+        model_config.model_name_or_path, num_labels=2, trust_remote_code=model_config.trust_remote_code, **model_kwargs
+    )
+    # Align padding tokens between tokenizer and model
+    model.config.pad_token_id = tokenizer.pad_token_id
+
+    # If post-training a base model, use ChatML as the default template
+    if tokenizer.chat_template is None:
+        model, tokenizer = setup_chat_format(model, tokenizer)
+
+    if model_config.use_peft and model_config.lora_task_type != "TOKEN_CLS":
+        warnings.warn(
+            "You are using a `task_type` that is different than `TOKEN_CLS` for PEFT. This will lead to silent bugs"
+            " Make sure to pass --lora_task_type TOKEN_CLS when using this script with PEFT."
+        )
+
+    ##############
+    # Load dataset
+    ##############
+    dataset = load_dataset(args.dataset_name)
+
+    ##########
+    # Training
+    ##########
+    trainer = StepwiseRewardTrainer(
+        model=model,
+        tokenizer=tokenizer,
+        args=training_args,
+        train_dataset=dataset[args.dataset_train_split],
+        eval_dataset=dataset[args.dataset_test_split],
+        peft_config=get_peft_config(model_config),
+    )
+    trainer.train()
+
+    ############################
+    # Save model and push to Hub
+    ############################
+    trainer.save_model(training_args.output_dir)
+    metrics = trainer.evaluate()
+    trainer.log_metrics("eval", metrics)
+    trainer.save_metrics("eval", metrics)
+    trainer.save_model(training_args.output_dir)
+    trainer.push_to_hub()