Move XPOTrainer to trl.experimental.xpo (#4485)

behroozazarkhalili · Invidia19 · qgallouedec · web-flow · commit 1bcfc500eb54 · 2025-11-06T14:40:12.000-07:00
Co-authored-by: Invidia19 &lt;54266187+Invidia19@users.noreply.github.com&gt;
Co-authored-by: Quentin Gallouédec &lt;gallouedec.quentin@gmail.com&gt;
diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
@@ -82,8 +82,6 @@
       title: RLOO
     - local: sft_trainer
       title: SFT
-    - local: xpo_trainer
-      title: XPO
     title: Trainers
   - local: models
     title: Model Classes
@@ -119,6 +117,8 @@
     title: GSPO-token
   - local: papo_trainer
     title: PAPO
+  - local: xpo_trainer
+    title: XPO
   - local: openenv
     title: OpenEnv Integration
   title: Experimental
diff --git a/docs/source/dataset_formats.md b/docs/source/dataset_formats.md
@@ -401,7 +401,7 @@ Choosing the right dataset type depends on the task you are working on and the s
 | [`RewardTrainer`] | [Preference (implicit prompt recommended)](#preference) |
 | [`RLOOTrainer`] | [Prompt-only](#prompt-only) |
 | [`SFTTrainer`] | [Language modeling](#language-modeling) or [Prompt-completion](#prompt-completion) |
-| [`XPOTrainer`] | [Prompt-only](#prompt-only) |
+| [`experimental.xpo.XPOTrainer`] | [Prompt-only](#prompt-only) |
 
 ## Using any dataset with TRL: preprocessing and conversion
 
diff --git a/docs/source/example_overview.md b/docs/source/example_overview.md
@@ -66,7 +66,7 @@ Scripts are maintained in the [`trl/scripts`](https://github.com/huggingface/trl
 | [`examples/scripts/sft_vlm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/sft_vlm.py) | This script shows how to use the [`SFTTrainer`] to fine-tune a Vision Language Model in a chat setting. The script has only been tested with [LLaVA 1.5](https://huggingface.co/llava-hf/llava-1.5-7b-hf), [LLaVA 1.6](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf), and [Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) models, so users may see unexpected behaviour in other model architectures. |
 | [`examples/scripts/sft_vlm_gemma3.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/sft_vlm_gemma3.py) | This script shows how to use the [`SFTTrainer`] to fine-tune a Gemma 3 model on vision to text tasks. |
 | [`examples/scripts/sft_vlm_smol_vlm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/sft_vlm_smol_vlm.py) | This script shows how to use the [`SFTTrainer`] to fine-tune a SmolVLM model. |
-| [`examples/scripts/xpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/xpo.py) | This script shows how to use the [`XPOTrainer`] to fine-tune a model. |
+| [`examples/scripts/xpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/xpo.py) | This script shows how to use the [`experimental.xpo.XPOTrainer`] to fine-tune a model. |
 
 ## Distributed Training (for scripts)
 
diff --git a/docs/source/index.md b/docs/source/index.md
@@ -26,7 +26,7 @@ Below is the current list of TRL trainers, organized by method type (⚡️ = vL
 - [`RLOOTrainer`] ⚡️
 - [`OnlineDPOTrainer`] ⚡️
 - [`NashMDTrainer`] ⚡️
-- [`XPOTrainer`] ⚡️
+- [`experimental.xpo.XPOTrainer`] 🧪 ⚡️
 - [`PPOTrainer`]
 
 ### Reward modeling
diff --git a/docs/source/vllm_integration.md b/docs/source/vllm_integration.md
@@ -11,7 +11,7 @@ This document will guide you through the process of using vLLM with TRL for fast
 > - [`GRPOTrainer`]
 > - [`OnlineDPOTrainer`]
 > - [`NashMDTrainer`]
-> - [`XPOTrainer`]
+> - [`experimental.xpo.XPOTrainer`]
 > - [`RLOOTrainer`]
 
 ## 🚀 How can I use vLLM with TRL to speed up training?
@@ -135,7 +135,7 @@ trainer.train()
 
 ```python
 from datasets import load_dataset
-from trl import XPOTrainer, XPOConfig
+from trl.experimental.xpo import XPOTrainer, XPOConfig
 
 dataset = load_dataset("trl-lib/tldr", split="train")
 
@@ -392,7 +392,7 @@ training_args = NashMDConfig(
 <hfoption id="XPO">
 
 ```python
-from trl import XPOConfig
+from trl.experimental.xpo import XPOConfig
 
 training_args = XPOConfig(
     ...,
@@ -467,7 +467,7 @@ training_args = NashMDConfig(
 <hfoption id="XPO">
 
 ```python
-from trl import XPOConfig
+from trl.experimental.xpo import XPOConfig
 
 training_args = XPOConfig(
     ...,
diff --git a/docs/source/xpo_trainer.md b/docs/source/xpo_trainer.md
@@ -12,6 +12,9 @@ The abstract from the paper is the following:
 
 This post-training method was contributed by [Kashif Rasul](https://huggingface.co/kashif),  [Quentin Gallouédec](https://huggingface.co/qgallouedec) and [Lewis Tunstall](https://huggingface.co/lewtun).
 
+> [!NOTE]
+> XPO is currently experimental. The API may change without notice while the feature is iterated on.
+
 ## Quick start
 
 This example demonstrates how to train a model using the XPO method. We use the [Qwen 0.5B model](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) as the base model and [`PairRMJudge`] as a judge. We use the prompts from the [UltraFeedback dataset](https://huggingface.co/datasets/openbmb/UltraFeedback). You can view the prompts in the dataset here:
@@ -27,7 +30,8 @@ Below is the script to train the model:
 ```python
 # train_xpo.py
 from datasets import load_dataset
-from trl import PairRMJudge, XPOConfig, XPOTrainer
+from trl import PairRMJudge
+from trl.experimental.xpo import XPOConfig, XPOTrainer
 from transformers import AutoModelForCausalLM, AutoTokenizer
 
 model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
@@ -62,7 +66,7 @@ The best programming language depends on individual preferences and familiarity
 
 ## Expected dataset type
 
-XPO requires a [prompt-only dataset](dataset_formats#prompt-only). The [`XPOTrainer`] supports both [conversational](dataset_formats#conversational) and [standard](dataset_formats#standard) dataset format. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
+XPO requires a [prompt-only dataset](dataset_formats#prompt-only). The [`experimental.xpo.XPOTrainer`] supports both [conversational](dataset_formats#conversational) and [standard](dataset_formats#standard) dataset format. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
 
 ## Usage tips
 
@@ -89,7 +93,7 @@ Instead of a judge, you can chose to use a reward model -- see [Reward Bench](ht
 
 ### Encourage EOS token generation
 
-When using a reward model, we may want the model to generate completions within a given length. During training, the model will generate completions up to the maximum length specified in the `max_new_tokens` argument of [`XPOConfig`]. If you want to penalize the model for not generating an EOS token before reaching the maximum length, you can use the `missing_eos_penalty` argument of [`XPOConfig`]:
+When using a reward model, we may want the model to generate completions within a given length. During training, the model will generate completions up to the maximum length specified in the `max_new_tokens` argument of [`experimental.xpo.XPOConfig`]. If you want to penalize the model for not generating an EOS token before reaching the maximum length, you can use the `missing_eos_penalty` argument of [`experimental.xpo.XPOConfig`]:
 
 ```python
 training_args = XPOConfig(..., max_new_tokens=128, missing_eos_penalty=1.0)
@@ -145,16 +149,16 @@ While training and evaluating we record the following reward metrics:
 * `logps/rejected`: The mean log probabilities of the rejected completions.
 * `val/model_contain_eos_token`: The amount of times the model's output contains the eos token.
 * `val/ref_contain_eos_token`: The amount of times the reference's output contains the eos token.
-* `alpha`: The weight of the XPO loss term. Typically fixed, but can be made dynamic by passing a list to [`XPOConfig`].
-* `beta`: The parameter that controls the weight of the loss term representing the deviation from the reference model. Typically fixed, but can be made dynamic by passing a list to [`XPOConfig`].
+* `alpha`: The weight of the XPO loss term. Typically fixed, but can be made dynamic by passing a list to [`experimental.xpo.XPOConfig`].
+* `beta`: The parameter that controls the weight of the loss term representing the deviation from the reference model. Typically fixed, but can be made dynamic by passing a list to [`experimental.xpo.XPOConfig`].
 
 ## XPOTrainer
 
-[[autodoc]] XPOTrainer
+[[autodoc]] experimental.xpo.XPOTrainer
     - train
     - save_model
     - push_to_hub
 
 ## XPOConfig
 
-[[autodoc]] XPOConfig
+[[autodoc]] experimental.xpo.XPOConfig
diff --git a/examples/scripts/xpo.py b/examples/scripts/xpo.py
@@ -52,11 +52,10 @@
     PairRMJudge,
     ScriptArguments,
     TrlParser,
-    XPOConfig,
-    XPOTrainer,
     get_kbit_device_map,
     get_quantization_config,
 )
+from trl.experimental.xpo import XPOConfig, XPOTrainer
 
 
 # Enable logging in a Hugging Face Space
diff --git a/tests/experimental/test_trainers_args.py b/tests/experimental/test_trainers_args.py
@@ -12,10 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
+import pytest
 from datasets import load_dataset
-from transformers import AutoTokenizer
+from transformers import AutoModelForCausalLM, AutoModelForSequenceClassification, AutoTokenizer
 
 from trl.experimental.bco import BCOConfig, BCOTrainer
+from trl.experimental.xpo import XPOConfig, XPOTrainer
 
 from ..testing_utils import TrlTestCase, require_sklearn
 
@@ -68,3 +70,25 @@ def test_bco(self):
         assert trainer.args.prompt_sample_size == 512
         assert trainer.args.min_density_ratio == 0.2
         assert trainer.args.max_density_ratio == 20.0
+
+    @pytest.mark.parametrize("alpha_list", [False, True])
+    def test_xpo(self, alpha_list):
+        model_id = "trl-internal-testing/tiny-Qwen2ForCausalLM-2.5"
+        tokenizer = AutoTokenizer.from_pretrained(model_id)
+        model = AutoModelForCausalLM.from_pretrained(model_id)
+        ref_model = AutoModelForCausalLM.from_pretrained(model_id)
+        reward_model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=1)
+        dataset = load_dataset("trl-internal-testing/zen", "standard_prompt_only", split="train")
+        training_args = XPOConfig(
+            self.tmp_dir,
+            alpha=0.5 if not alpha_list else [0.5, 0.6],
+        )
+        trainer = XPOTrainer(
+            args=training_args,
+            processing_class=tokenizer,
+            model=model,
+            ref_model=ref_model,
+            reward_funcs=reward_model,
+            train_dataset=dataset,
+        )
+        assert trainer.args.alpha == (0.5 if not alpha_list else [0.5, 0.6])
diff --git a/tests/experimental/test_xpo_trainer.py b/tests/experimental/test_xpo_trainer.py
@@ -17,15 +17,16 @@
 from transformers import AutoModelForCausalLM, AutoModelForSequenceClassification, AutoTokenizer
 from transformers.utils import is_peft_available
 
-from trl import XPOConfig, XPOTrainer
+from trl.experimental.xpo import XPOConfig, XPOTrainer
 
-from .testing_utils import RandomPairwiseJudge, TrlTestCase, require_llm_blender, require_peft
+from ..testing_utils import RandomPairwiseJudge, TrlTestCase, require_llm_blender, require_peft
 
 
 if is_peft_available():
     from peft import LoraConfig, get_peft_model
 
 
+@pytest.mark.low_priority
 class TestXPOTrainer(TrlTestCase):
     def setup_method(self):
         self.model_id = "trl-internal-testing/tiny-Qwen2ForCausalLM-2.5"
diff --git a/tests/test_trainers_args.py b/tests/test_trainers_args.py
@@ -34,8 +34,6 @@
     RewardTrainer,
     SFTConfig,
     SFTTrainer,
-    XPOConfig,
-    XPOTrainer,
 )
 
 from .testing_utils import TrlTestCase
@@ -320,25 +318,3 @@ def test_sft(self):
         assert "append_concat_token" in trainer.args.dataset_kwargs
         assert trainer.args.dataset_kwargs["append_concat_token"]
         assert trainer.args.eval_packing
-
-    @pytest.mark.parametrize("alpha_list", [False, True])
-    def test_xpo(self, alpha_list):
-        model_id = "trl-internal-testing/tiny-Qwen2ForCausalLM-2.5"
-        tokenizer = AutoTokenizer.from_pretrained(model_id)
-        model = AutoModelForCausalLM.from_pretrained(model_id)
-        ref_model = AutoModelForCausalLM.from_pretrained(model_id)
-        reward_model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=1)
-        dataset = load_dataset("trl-internal-testing/zen", "standard_prompt_only", split="train")
-        training_args = XPOConfig(
-            self.tmp_dir,
-            alpha=0.5 if not alpha_list else [0.5, 0.6],
-        )
-        trainer = XPOTrainer(
-            args=training_args,
-            processing_class=tokenizer,
-            model=model,
-            ref_model=ref_model,
-            reward_funcs=reward_model,
-            train_dataset=dataset,
-        )
-        assert trainer.args.alpha == (0.5 if not alpha_list else [0.5, 0.6])
diff --git a/trl/experimental/xpo/__init__.py b/trl/experimental/xpo/__init__.py
@@ -0,0 +1,19 @@
+# Copyright 2020-2025 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from .xpo_config import XPOConfig
+from .xpo_trainer import XPOTrainer
+
+
+__all__ = ["XPOConfig", "XPOTrainer"]
diff --git a/trl/experimental/xpo/xpo_config.py b/trl/experimental/xpo/xpo_config.py
@@ -0,0 +1,44 @@
+# Copyright 2020-2025 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from dataclasses import dataclass, field
+
+from ...trainer.online_dpo_config import OnlineDPOConfig
+
+
+@dataclass
+class XPOConfig(OnlineDPOConfig):
+    r"""
+    Configuration class for the [`experimental.xpo.XPOTrainer`].
+
+    Subclass of [`OnlineDPOConfig`] we can use all its arguments and add the following:
+
+    Parameters:
+        alpha (`float` or `list[float]`, *optional*, defaults to `1e-5`):
+            Weight of the XPO loss term. If a list of floats is provided then the alpha is selected for each new epoch
+            and the last alpha is used for the rest of the epochs.
+    """
+
+    alpha: list[float] = field(
+        default_factory=lambda: [1e-5],
+        metadata={
+            "help": "Weight of the XPO loss term. If a list of floats is provided then the alpha is selected for each "
+            "new epoch and the last alpha is used for the rest of the epochs."
+        },
+    )
+
+    def __post_init__(self):
+        super().__post_init__()
+        if hasattr(self.alpha, "__len__") and len(self.alpha) == 1:
+            self.alpha = self.alpha[0]
diff --git a/trl/experimental/xpo/xpo_trainer.py b/trl/experimental/xpo/xpo_trainer.py
diff --git a/trl/trainer/xpo_config.py b/trl/trainer/xpo_config.py
diff --git a/trl/trainer/xpo_trainer.py b/trl/trainer/xpo_trainer.py