PrimeIntellect-ai · willccbb · Feb 7, 2026 · Feb 6, 2026 · Feb 6, 2026 · Feb 6, 2026
diff --git a/.github/workflows/publish-verifiers-rl.yml b/.github/workflows/publish-verifiers-rl.yml
@@ -0,0 +1,77 @@
+name: Publish verifiers-rl
+
+on:
+  workflow_dispatch:
+    inputs:
+      tag:
+        description: 'Existing tag to release (e.g. verifiers-rl-v0.1.0)'
+        required: true
+        type: string
+  push:
+    tags:
+      - "verifiers-rl-v*"
+
+jobs:
+  publish:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Checkout tagged release (dispatch)
+        if: github.event_name == 'workflow_dispatch'
+        uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
+          ref: refs/tags/${{ inputs.tag }}
+
+      - name: Checkout tagged release (push)
+        if: github.event_name != 'workflow_dispatch'
+        uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
+
+      - name: Resolve release tag
+        id: release
+        env:
+          EVENT_NAME: ${{ github.event_name }}
+          PUSHED_REF: ${{ github.ref_name }}
+          INPUT_TAG: ${{ github.event_name == 'workflow_dispatch' && inputs.tag || '' }}
+        run: |
+          if [ "$EVENT_NAME" = "workflow_dispatch" ]; then
+            TAG="$INPUT_TAG"
+          else
+            TAG="$PUSHED_REF"
+          fi
+
+          case "$TAG" in
+            verifiers-rl-v*) ;;
+            *)
+              echo "Release tags must be prefixed with 'verifiers-rl-v' (received '$TAG')" >&2
+              exit 1
+              ;;
+          esac
+
+          VERSION="${TAG#verifiers-rl-v}"
+          FILE_VERSION=$(python - <<'PY'
+          import tomllib
+          from pathlib import Path
+          with Path('packages/verifiers-rl/pyproject.toml').open('rb') as f:
+              data = tomllib.load(f)
+          print(data['project']['version'])
+          PY
+          )
+
+          if [ "$FILE_VERSION" != "$VERSION" ]; then
+            echo "Version mismatch: tag requests '$VERSION' but packages/verifiers-rl/pyproject.toml defines '$FILE_VERSION'" >&2
+            exit 1
+          fi
+
+          echo "tag=$TAG" >> "$GITHUB_OUTPUT"
+
+      - uses: astral-sh/setup-uv@v6
+
+      - name: Build verifiers-rl
+        run: uv build packages/verifiers-rl
+
+      - name: Publish to PyPI
+        env:
+          PYPI_TOKEN: ${{ secrets.PYPI_TOKEN }}
+        run: uv publish --token "$PYPI_TOKEN" packages/verifiers-rl/dist/*
diff --git a/.github/workflows/style.yml b/.github/workflows/style.yml
@@ -39,6 +39,6 @@ jobs:
         with:
           version: "latest"
       - name: Install dependencies
-        run: uv sync --extra rl
+        run: uv sync
       - name: Run ty
         run: uv run ty check verifiers
diff --git a/docs/training.md b/docs/training.md
@@ -108,15 +108,15 @@ This will launch a tmux session with separate panes for the trainer, orchestrato
 
 If you want to hack on new training algorithms and are less concerned with maximum performance or advanced features, you can use the included `RLTrainer` (via `vf-rl`), whose core files are under 1000 lines of code and include only the most essential logic for fairly-performant async off-policy training (with a similar core algorithm as `prime-rl`).
 
-The included `RLTrainer` is a minimal, hackable training loop based on `transformers.Trainer` that supports both full-parameter finetuning and LoRA training. `RLTrainer` can be viewed as a "baby" `prime-rl` that adopts a similar default training recipe (async CISPO with one-step off-policy overlap), intended for single-node test runs with dense models. The primary files (`trainer.py` and `orchestrator.py`, located in `verifiers/rl/trainer/`) are under 1000 lines of code, and are designed to be a convenient starting point for writing your own training loop.
+The included `RLTrainer` is a minimal, hackable training loop based on `transformers.Trainer` that supports both full-parameter finetuning and LoRA training. `RLTrainer` can be viewed as a "baby" `prime-rl` that adopts a similar default training recipe (async CISPO with one-step off-policy overlap), intended for single-node test runs with dense models. The primary files (`trainer.py` and `orchestrator.py`, located in `packages/verifiers-rl/verifiers_rl/rl/trainer/`) are under 1000 lines of code, and are designed to be a convenient starting point for writing your own training loop.
 
 The feature set is intentionally kept minimal and focused. Users seeking maximum performance, MoE support, multi-node training, multidimensional parallelism, and other advanced features should use the `prime-rl` trainer. 
 
 ### Setup and Configuration
 
-To use `vf.RLTrainer` in your own project, install with RL extras:
+To use `vf.RLTrainer` in your own project, install the optional RL package:
 ```bash
-uv add 'verifiers[rl]'
+uv add verifiers-rl
 ```
 
 Then, use the `vf-setup` script to download example configuration files for `vf.RLTrainer` into your workspace:

diff --git a/packages/verifiers-rl/README.md b/packages/verifiers-rl/README.md
@@ -0,0 +1,18 @@
+# verifiers-rl
+
+Optional RL trainer package for `verifiers`.
+
+Install:
+
+```bash
+uv add verifiers-rl
+```
+
+This package provides:
+
+- `vf-rl`
+- `vf-train`
+- `vf-vllm`
+- `verifiers_rl.rl` (RLTrainer implementation)
+
+`verifiers` core remains usable without this package.
diff --git a/packages/verifiers-rl/pyproject.toml b/packages/verifiers-rl/pyproject.toml
@@ -0,0 +1,37 @@
+[build-system]
+requires = ["hatchling"]
+build-backend = "hatchling.build"
+
+[project]
+name = "verifiers-rl"
+version = "0.1.0"
+description = "Optional RL trainer package for verifiers"
+readme = "README.md"
+requires-python = ">=3.10,<3.14"
+dependencies = [
+    "verifiers",
+    "torch>=2.8.0,<2.9.0",
+    "transformers>=4.56.2",
+    "accelerate>=1.4.0",
+    "requests",
+    "peft",
+    "wandb",
+    "vllm>=0.10.0,<0.11.0",
+    "liger-kernel>=0.5.10",
+    "deepspeed>=0.17.6",
+    "flash-attn>=2.8.3",
+]
+
+[tool.uv.extra-build-dependencies]
+flash-attn = [{ requirement = "torch", match-runtime = true }]
+
+[tool.uv.extra-build-variables]
+flash-attn = { FLASH_ATTENTION_SKIP_CUDA_BUILD = "TRUE" }
+
+[project.scripts]
+vf-rl = "verifiers_rl.scripts.rl:main"
+vf-train = "verifiers_rl.scripts.train:main"
+vf-vllm = "verifiers_rl.rl.inference.server:main"
+
+[tool.hatch.build.targets.wheel]
+packages = ["verifiers_rl"]
diff --git a/packages/verifiers-rl/verifiers_rl/__init__.py b/packages/verifiers-rl/verifiers_rl/__init__.py
@@ -0,0 +1,21 @@
+from verifiers_rl.rl.trainer import (  # noqa: F401
+    GRPOConfig,
+    GRPOTrainer,
+    RLConfig,
+    RLTrainer,
+    get_model,
+    get_model_and_tokenizer,
+    grpo_defaults,
+    lora_defaults,
+)
+
+__all__ = [
+    "get_model",
+    "get_model_and_tokenizer",
+    "RLConfig",
+    "RLTrainer",
+    "GRPOTrainer",
+    "GRPOConfig",
+    "grpo_defaults",
+    "lora_defaults",
+]
diff --git a/packages/verifiers-rl/verifiers_rl/rl/README.md b/packages/verifiers-rl/verifiers_rl/rl/README.md
@@ -0,0 +1,108 @@
+## `RLTrainer`
+
+`RLTrainer` is the included RL trainer for `verifiers` environments, built on top of `transformers`, `accelerate` and `vllm`, and which supports both full-parameter finetuning and LoRA training. It is primarily intended for small-scale test runs on a single node with dense models. Users seeking maximum performance, MoE support, multi-node training, multidimensional parallelism, and other advanced features should use the external `prime-rl` trainer; `RLTrainer` can be viewed as a "baby" `prime-rl` that adopts a similar default training recipe (async RL with group-based advantages and ), and is a good starting point for beginners.
+
+### Installation
+
+Install with RL extras:
+
+```bash
+uv add verifiers-rl
+```
+
+Install from GitHub main with RL extras:
+
+```bash
+uv add 'verifiers-rl @ git+https://github.com/PrimeIntellect-ai/verifiers.git@main#subdirectory=packages/verifiers-rl'
+```
+
+If you already have the project set up and want to include the `rl` extra:
+
+```bash
+uv sync --extra rl
+```
+
+### TOML configuration files
+
+`vf-rl` consumes a single TOML file that defines the model, environment, vLLM (inference) process, and trainer.
+
+- Required keys:
+  - `model` (string)
+  - `[env].id` (string; environment slug)
+  - `[inference].gpus` (int; number of GPUs for vLLM)
+  - `[trainer].gpus` (int; number of GPUs for training)
+- Optional `*.args` tables forward keyword arguments to their respective CLIs:
+  - `[inference.args]` → forwarded to `vf-vllm` (keys converted to `--kebab-case` flags)
+  - `[trainer.args]` → mapped to `RLConfig` (see Configuration below)
+
+Minimal example:
+
+```toml
+model = "Qwen/Qwen3-4B-Instruct-2507"
+
+[env]
+id = "kalomaze/alphabet-sort"
+
+[inference]
+gpus = 1
+
+[inference.args]
+enforce_eager = true
+
+[trainer]
+gpus = 1
+
+[trainer.args]
+run_name = "alphabet-sort"
+use_lora = true
+learning_rate = 1e-5
+micro_batch_size = 4
+rollouts_per_example = 16
+batch_size = 512
+max_steps = 100
+max_tokens = 512
+max_seq_len = 2048
+```
+
+See more examples under `configs/rl/` (e.g., `reverse-text.toml`, `alphabet-sort.toml`).
+
+### Running with `vf-rl`
+
+`vf-rl` creates a tmux session with two panes: top runs `vf-vllm` (inference server), bottom runs `vf-train` (trainer). GPU assignment is contiguous: inference uses the first `inference.gpus` devices, trainer uses the next `trainer.gpus` devices.
+
+Usage:
+
+```bash
+uv run vf-rl @ configs/rl/config.toml -s session-name
+```
+
+- `-s/--session`: tmux session name (default: `vf-rl`)
+- Requires `tmux` in `PATH`
+
+### Configuration
+
+We have removed a number of features from the previous `GRPOTrainer`, in favor of a more streamlined, opinionated, and hackable training recipe. The primary parameters most users will want to configure are:
+- LoRA configuration arguments:
+    - `use_lora`: whether to use LoRA training (default is `True`)
+    - `lora_rank`: the rank of the LoRA modules (default is `16`)
+    - `lora_alpha`: the alpha of the LoRA modules (default is `16`)
+    - `lora_dropout`: the dropout of the LoRA modules (default is `0.0`)
+    - `lora_target_modules`: the target modules for the LoRA modules (default is `["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "down_proj", "up_proj"]`)
+    - `lora_modules_to_save`: modules for full-parameter finetuning (instead of LoRA modules; default is `None`)
+    - `lora_use_rslora`: whether to use RSLoRA (default is `False`)
+- Training configuration arguments:
+  - `learning_rate`: the learning rate for the training (default is `1e-5`)
+  - `micro_batch_size`: rollouts per GPU per gradient accumulation step (default is `8`)
+  - `batch_size`: rollouts per global batch (default is `512`)
+  - `rollouts_per_example`: rollouts per example/prompt (default is `16`)
+  - `max_seq_len`: the maximum sequence length for the training (default is `2048`)
+  - `max_steps`: the maximum number of steps for the training (default is `500`)
+- Sampling configuration arguments:
+  - `max_tokens`: the maximum number of tokens per request (default is `None`)
+  - `temperature`: the temperature for the sampling (default is `0.7`)
+  - `top_p`: the top-p value for the sampling (default is `1.0`)
+  - `top_k`: the top-k value for the sampling (default is `None`)
+  - `min_p`: the minimum probability value for the sampling (default is `0.0`)
+  - `repetition_penalty`: the repetition penalty for the sampling (default is `1.0`)
+  - `presence_penalty`: the presence penalty for the sampling (default is `0.0`)
+  - `frequency_penalty`: the frequency penalty for the sampling (default is `0.0`)
diff --git a/packages/verifiers-rl/verifiers_rl/rl/inference/__init__.py b/packages/verifiers-rl/verifiers_rl/rl/inference/__init__.py