Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 77 additions & 0 deletions .github/workflows/publish-verifiers-rl.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
name: Publish verifiers-rl

on:
workflow_dispatch:
inputs:
tag:
description: 'Existing tag to release (e.g. verifiers-rl-v0.1.0)'
required: true
type: string
push:
tags:
- "verifiers-rl-v*"

jobs:
publish:
runs-on: ubuntu-latest
steps:
- name: Checkout tagged release (dispatch)
if: github.event_name == 'workflow_dispatch'
uses: actions/checkout@v4
with:
fetch-depth: 0
ref: refs/tags/${{ inputs.tag }}

- name: Checkout tagged release (push)
if: github.event_name != 'workflow_dispatch'
uses: actions/checkout@v4
with:
fetch-depth: 0

- name: Resolve release tag
id: release
env:
EVENT_NAME: ${{ github.event_name }}
PUSHED_REF: ${{ github.ref_name }}
INPUT_TAG: ${{ github.event_name == 'workflow_dispatch' && inputs.tag || '' }}
run: |
if [ "$EVENT_NAME" = "workflow_dispatch" ]; then
TAG="$INPUT_TAG"
else
TAG="$PUSHED_REF"
fi

case "$TAG" in
verifiers-rl-v*) ;;
*)
echo "Release tags must be prefixed with 'verifiers-rl-v' (received '$TAG')" >&2
exit 1
;;
esac

VERSION="${TAG#verifiers-rl-v}"
FILE_VERSION=$(python - <<'PY'
import tomllib
from pathlib import Path
with Path('packages/verifiers-rl/pyproject.toml').open('rb') as f:
data = tomllib.load(f)
print(data['project']['version'])
PY
)

if [ "$FILE_VERSION" != "$VERSION" ]; then
echo "Version mismatch: tag requests '$VERSION' but packages/verifiers-rl/pyproject.toml defines '$FILE_VERSION'" >&2
exit 1
fi

echo "tag=$TAG" >> "$GITHUB_OUTPUT"

- uses: astral-sh/setup-uv@v6

- name: Build verifiers-rl
run: uv build packages/verifiers-rl

- name: Publish to PyPI
env:
PYPI_TOKEN: ${{ secrets.PYPI_TOKEN }}
run: uv publish --token "$PYPI_TOKEN" packages/verifiers-rl/dist/*
2 changes: 1 addition & 1 deletion .github/workflows/style.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,6 @@ jobs:
with:
version: "latest"
- name: Install dependencies
run: uv sync --extra rl
run: uv sync
- name: Run ty
run: uv run ty check verifiers
6 changes: 3 additions & 3 deletions docs/training.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,15 +108,15 @@ This will launch a tmux session with separate panes for the trainer, orchestrato

If you want to hack on new training algorithms and are less concerned with maximum performance or advanced features, you can use the included `RLTrainer` (via `vf-rl`), whose core files are under 1000 lines of code and include only the most essential logic for fairly-performant async off-policy training (with a similar core algorithm as `prime-rl`).

The included `RLTrainer` is a minimal, hackable training loop based on `transformers.Trainer` that supports both full-parameter finetuning and LoRA training. `RLTrainer` can be viewed as a "baby" `prime-rl` that adopts a similar default training recipe (async CISPO with one-step off-policy overlap), intended for single-node test runs with dense models. The primary files (`trainer.py` and `orchestrator.py`, located in `verifiers/rl/trainer/`) are under 1000 lines of code, and are designed to be a convenient starting point for writing your own training loop.
The included `RLTrainer` is a minimal, hackable training loop based on `transformers.Trainer` that supports both full-parameter finetuning and LoRA training. `RLTrainer` can be viewed as a "baby" `prime-rl` that adopts a similar default training recipe (async CISPO with one-step off-policy overlap), intended for single-node test runs with dense models. The primary files (`trainer.py` and `orchestrator.py`, located in `packages/verifiers-rl/verifiers_rl/rl/trainer/`) are under 1000 lines of code, and are designed to be a convenient starting point for writing your own training loop.

The feature set is intentionally kept minimal and focused. Users seeking maximum performance, MoE support, multi-node training, multidimensional parallelism, and other advanced features should use the `prime-rl` trainer.

### Setup and Configuration

To use `vf.RLTrainer` in your own project, install with RL extras:
To use `vf.RLTrainer` in your own project, install the optional RL package:
```bash
uv add 'verifiers[rl]'
uv add verifiers-rl
```

Then, use the `vf-setup` script to download example configuration files for `vf.RLTrainer` into your workspace:
Expand Down
18 changes: 18 additions & 0 deletions packages/verifiers-rl/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# verifiers-rl

Optional RL trainer package for `verifiers`.

Install:

```bash
uv add verifiers-rl
```

This package provides:

- `vf-rl`
- `vf-train`
- `vf-vllm`
- `verifiers_rl.rl` (RLTrainer implementation)

`verifiers` core remains usable without this package.
37 changes: 37 additions & 0 deletions packages/verifiers-rl/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[project]
name = "verifiers-rl"
version = "0.1.0"
description = "Optional RL trainer package for verifiers"
readme = "README.md"
requires-python = ">=3.10,<3.14"
dependencies = [
"verifiers",
"torch>=2.8.0,<2.9.0",
"transformers>=4.56.2",
"accelerate>=1.4.0",
"requests",
"peft",
"wandb",
"vllm>=0.10.0,<0.11.0",
"liger-kernel>=0.5.10",
"deepspeed>=0.17.6",
"flash-attn>=2.8.3",
]

[tool.uv.extra-build-dependencies]
flash-attn = [{ requirement = "torch", match-runtime = true }]

[tool.uv.extra-build-variables]
flash-attn = { FLASH_ATTENTION_SKIP_CUDA_BUILD = "TRUE" }

[project.scripts]
vf-rl = "verifiers_rl.scripts.rl:main"
vf-train = "verifiers_rl.scripts.train:main"
vf-vllm = "verifiers_rl.rl.inference.server:main"

[tool.hatch.build.targets.wheel]
packages = ["verifiers_rl"]
21 changes: 21 additions & 0 deletions packages/verifiers-rl/verifiers_rl/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
from verifiers_rl.rl.trainer import ( # noqa: F401
GRPOConfig,
GRPOTrainer,
RLConfig,
RLTrainer,
get_model,
get_model_and_tokenizer,
grpo_defaults,
lora_defaults,
)

__all__ = [
"get_model",
"get_model_and_tokenizer",
"RLConfig",
"RLTrainer",
"GRPOTrainer",
"GRPOConfig",
"grpo_defaults",
"lora_defaults",
]
108 changes: 108 additions & 0 deletions packages/verifiers-rl/verifiers_rl/rl/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
## `RLTrainer`

`RLTrainer` is the included RL trainer for `verifiers` environments, built on top of `transformers`, `accelerate` and `vllm`, and which supports both full-parameter finetuning and LoRA training. It is primarily intended for small-scale test runs on a single node with dense models. Users seeking maximum performance, MoE support, multi-node training, multidimensional parallelism, and other advanced features should use the external `prime-rl` trainer; `RLTrainer` can be viewed as a "baby" `prime-rl` that adopts a similar default training recipe (async RL with group-based advantages and ), and is a good starting point for beginners.

### Installation

Install with RL extras:

```bash
uv add verifiers-rl
```

Install from GitHub main with RL extras:

```bash
uv add 'verifiers-rl @ git+https://github.com/PrimeIntellect-ai/verifiers.git@main#subdirectory=packages/verifiers-rl'
```

If you already have the project set up and want to include the `rl` extra:

```bash
uv sync --extra rl
```

### TOML configuration files

`vf-rl` consumes a single TOML file that defines the model, environment, vLLM (inference) process, and trainer.

- Required keys:
- `model` (string)
- `[env].id` (string; environment slug)
- `[inference].gpus` (int; number of GPUs for vLLM)
- `[trainer].gpus` (int; number of GPUs for training)
- Optional `*.args` tables forward keyword arguments to their respective CLIs:
- `[inference.args]` → forwarded to `vf-vllm` (keys converted to `--kebab-case` flags)
- `[trainer.args]` → mapped to `RLConfig` (see Configuration below)

Minimal example:

```toml
model = "Qwen/Qwen3-4B-Instruct-2507"

[env]
id = "kalomaze/alphabet-sort"

[inference]
gpus = 1

[inference.args]
enforce_eager = true

[trainer]
gpus = 1

[trainer.args]
run_name = "alphabet-sort"
use_lora = true
learning_rate = 1e-5
micro_batch_size = 4
rollouts_per_example = 16
batch_size = 512
max_steps = 100
max_tokens = 512
max_seq_len = 2048
```

See more examples under `configs/rl/` (e.g., `reverse-text.toml`, `alphabet-sort.toml`).

### Running with `vf-rl`

`vf-rl` creates a tmux session with two panes: top runs `vf-vllm` (inference server), bottom runs `vf-train` (trainer). GPU assignment is contiguous: inference uses the first `inference.gpus` devices, trainer uses the next `trainer.gpus` devices.

Usage:

```bash
uv run vf-rl @ configs/rl/config.toml -s session-name
```

- `-s/--session`: tmux session name (default: `vf-rl`)
- Requires `tmux` in `PATH`

### Configuration

We have removed a number of features from the previous `GRPOTrainer`, in favor of a more streamlined, opinionated, and hackable training recipe. The primary parameters most users will want to configure are:
- LoRA configuration arguments:
- `use_lora`: whether to use LoRA training (default is `True`)
- `lora_rank`: the rank of the LoRA modules (default is `16`)
- `lora_alpha`: the alpha of the LoRA modules (default is `16`)
- `lora_dropout`: the dropout of the LoRA modules (default is `0.0`)
- `lora_target_modules`: the target modules for the LoRA modules (default is `["q_proj", "v_proj", "k_proj", "o_proj", "gate_proj", "down_proj", "up_proj"]`)
- `lora_modules_to_save`: modules for full-parameter finetuning (instead of LoRA modules; default is `None`)
- `lora_use_rslora`: whether to use RSLoRA (default is `False`)
- Training configuration arguments:
- `learning_rate`: the learning rate for the training (default is `1e-5`)
- `micro_batch_size`: rollouts per GPU per gradient accumulation step (default is `8`)
- `batch_size`: rollouts per global batch (default is `512`)
- `rollouts_per_example`: rollouts per example/prompt (default is `16`)
- `max_seq_len`: the maximum sequence length for the training (default is `2048`)
- `max_steps`: the maximum number of steps for the training (default is `500`)
- Sampling configuration arguments:
- `max_tokens`: the maximum number of tokens per request (default is `None`)
- `temperature`: the temperature for the sampling (default is `0.7`)
- `top_p`: the top-p value for the sampling (default is `1.0`)
- `top_k`: the top-k value for the sampling (default is `None`)
- `min_p`: the minimum probability value for the sampling (default is `0.0`)
- `repetition_penalty`: the repetition penalty for the sampling (default is `1.0`)
- `presence_penalty`: the presence penalty for the sampling (default is `0.0`)
- `frequency_penalty`: the frequency penalty for the sampling (default is `0.0`)
Empty file.
Loading
Loading