feat(rl): Add SFTTrainer for supervised fine-tuning by gspeter-max · Pull Request #935 · PrimeIntellect-ai/verifiers

gspeter-max · 2026-02-19T05:08:40Z

Summary

Implements #752: Adds SFTTrainer and SFTConfig classes to provide a consistent API for supervised fine-tuning, matching the pattern of the existing RLTrainer.

Changes

Core Implementation

SFTConfig: Configuration class extending TrainingArguments with 40 fields for model loading, LoRA, batch parameters, training parameters, and optional vLLM integration
SFTTrainer: Training class extending Trainer with simple cross-entropy loss (no PPO), optional vLLM sampling, and same logging/metrics patterns as RLTrainer

Key Features

✅ Consistent API with RLTrainer (same config structure, same initialization)
✅ Reuses existing utilities (model loading, logging, LoRA setup)
✅ Optional vLLM integration for sample monitoring (not required)
✅ Enables easy SFT → RL workflows
✅ CLI interface: vf-sft @ path/to/config.toml

Files Modified

packages/verifiers-rl/verifiers_rl/rl/trainer/config.py - Added SFTConfig
packages/verifiers-rl/verifiers_rl/rl/trainer/trainer.py - Added SFTTrainer
packages/verifiers-rl/verifiers_rl/rl/trainer/__init__.py - Added exports
packages/verifiers-rl/verifiers_rl/__init__.py - Added exports
verifiers/__init__.py - Added lazy imports
packages/verifiers-rl/verifiers_rl/scripts/sft.py - Created SFT script
packages/verifiers-rl/pyproject.toml - Added vf-sft entry point

Usage Example

import verifiers as vf
from datasets import load_dataset

# Load dataset
dataset = load_dataset("willcb/V3-wordle", split="train")

# Create config
config = vf.SFTConfig(
    run_name="wordle-sft",
    max_steps=500,
    learning_rate=2e-5,
    batch_size=512,
    micro_batch_size=8,
)

# Create trainer
trainer = vf.SFTTrainer(
    model="Qwen/Qwen3-4B-Instruct",
    train_dataset=dataset,
    args=config,
)

# Train
trainer.train()

Or via CLI:

vf-sft @ path/to/config.toml

Example TOML config:

model = "Qwen/Qwen3-4B-Instruct"
dataset = "willcb/V3-wordle"

[sft]
run_name = "wordle-sft"
max_steps = 500
learning_rate = 2e-5
batch_size = 512
micro_batch_size = 8
use_lora = true
lora_rank = 8

Design Philosophy

Similar Structure: SFTTrainer follows the same class structure as RLTrainer
Simpler Internals: SFT is inherently simpler (no rollouts, no advantages, no orchestrator)
Optional vLLM: For monitoring sample quality during training (not required)
Reuses Existing Utilities: Leverages all existing model loading, logging, and utilities

Closes #752

Note

Medium Risk
Introduces new training-path code and a new CLI entrypoint, plus touches shared config/LoRA initialization; issues would mainly affect training runs and logging rather than core runtime behavior.

Overview
Adds supervised fine-tuning support alongside the existing RL trainer by introducing SFTConfig (a TrainingArguments-based config that auto-computes gradient_accumulation_steps from batch_size/micro_batch_size and reuses the LoRA setup) and SFTTrainer (a Trainer subclass using standard cross-entropy loss, with optional vLLM-based sample generation and logging).

Exposes the new APIs via verifiers_rl and verifiers lazy imports, adds a new vf-sft TOML-driven CLI script, and updates documentation (docs/reference.md, docs/training.md, and the training skill guide) with usage examples and an SFT → RL workflow.

^{Written by Cursor Bugbot for commit 2475bf7. This will update automatically on new commits. Configure here.}

ROOT CAUSE: - Issue PrimeIntellect-ai#752: Users need SFT support similar to RLTrainer - Currently must use external trl library with different interface - No consistent API between SFT and RL training workflows CHANGES: 1. Added SFTConfig class to verifiers_rl/rl/trainer/config.py - Extends TrainingArguments (same pattern as RLConfig) - 40 configuration fields including model loading, LoRA, batch args, training params - Optional vLLM integration for sample generation (disabled by default) - Auto-sets output_dir from run_name - Validates batch size divisibility 2. Added SFTTrainer class to verifiers_rl/rl/trainer/trainer.py - Extends Trainer (same base as RLTrainer) - Implements simple cross-entropy loss (not PPO) - No orchestrator (static dataset, no async rollouts) - Optional vLLM sampling for monitoring sample quality - Same logging/metrics patterns as RLTrainer 3. Updated exports in multiple __init__.py files - packages/verifiers-rl/verifiers_rl/rl/trainer/__init__.py - packages/verifiers-rl/verifiers_rl/__init__.py - verifiers/__init__.py (lazy imports, TYPE_CHECKING, rl_names) 4. Created SFT training script (verifiers_rl/scripts/sft.py) - CLI interface following pattern of train.py - TOML configuration support - Usage: vf-sft @ path/to/config.toml 5. Added vf-sft CLI entry point to pyproject.toml 6. Created example config (configs/local/vf-sft/example-sft.toml) 7. Created comprehensive test suite (tests/test_sft_trainer.py) IMPACT: - Provides consistent API between SFT and RL training - Enables easy SFT → RL workflows - Same configuration, logging, and model loading patterns - Optional vLLM integration for monitoring FILES MODIFIED: - packages/verifiers-rl/verifiers_rl/rl/trainer/config.py - packages/verifiers-rl/verifiers_rl/rl/trainer/trainer.py - packages/verifiers-rl/verifiers_rl/rl/trainer/__init__.py - packages/verifiers-rl/verifiers_rl/__init__.py - verifiers/__init__.py - packages/verifiers-rl/verifiers_rl/scripts/sft.py - packages/verifiers-rl/pyproject.toml - configs/local/vf-sft/example-sft.toml (created) - tests/test_sft_trainer.py (created) Refs: PrimeIntellect-ai#752

CLAassistant · 2026-02-19T05:08:48Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Removes: - configs/local/vf-sft/example-sft.toml (example config) - tests/test_sft_trainer.py (test suite) These were created for development/testing purposes.

packages/verifiers-rl/verifiers_rl/rl/trainer/config.py

packages/verifiers-rl/verifiers_rl/rl/trainer/trainer.py

ROOT CAUSE: - Bugbot identified 3 issues in the SFTTrainer implementation - Missing gradient_accumulation_steps calculation makes batch_size ineffective - Log method prints empty panels every step when use_vllm=False - No documentation added for new SFTTrainer/SFTConfig classes CHANGES: 1. Fixed gradient_accumulation_steps calculation in SFTConfig - Added calculation: gradient_accumulation_steps = batch_size // (micro_batch_size * num_processes) - This ensures effective batch size matches user-configured batch_size 2. Fixed log method to check for empty textual logs - Added check: if len(textual_logs["prompt"]) > 0 - Only prints samples and logs to W&B when samples are available - Prevents empty Rich panels and empty W&B tables 3. Added documentation for SFTTrainer and SFTConfig - Updated docs/training.md with SFT section and usage examples - Updated docs/reference.md with SFTConfig and SFTTrainer API docs - Updated skills/train-with-environments/SKILL.md with SFT → RL workflow IMPACT: - batch_size parameter now works correctly - No more empty console output or W&B tables - Users can understand and use SFTTrainer API FILES MODIFIED: - packages/verifiers-rl/verifiers_rl/rl/trainer/config.py - packages/verifiers-rl/verifiers_rl/rl/trainer/trainer.py - docs/training.md - docs/reference.md - skills/train-with-environments/SKILL.md Refs: PrimeIntellect-ai#935

packages/verifiers-rl/verifiers_rl/rl/trainer/config.py

packages/verifiers-rl/verifiers_rl/scripts/sft.py

packages/verifiers-rl/verifiers_rl/rl/trainer/trainer.py

ROOT CAUSE: Three separate bugs identified by Cursor Bugbot in PR PrimeIntellect-ai#935: 1. LoRA config parameters (dropout, modules_to_save, use_rslora) defined but not passed to LoraConfig 2. CLI script hardcoded dataset split to "train", ignoring config value 3. SFTTrainer.log_metrics override prevented parent's formatted metric summary CHANGES: - Added lora_dropout, modules_to_save, use_rslora to LoraConfig in RLConfig (line 302) - Added lora_dropout, modules_to_save, use_rslora to LoraConfig in SFTConfig (line 544) - Changed sft.py to use dataset_split from config instead of hardcoded "train" - Removed SFTTrainer.log_metrics override to restore parent's formatted output IMPACT: - Users can now configure LoRA dropout, modules to save, and RSLoRA - CLI script respects dataset_split configuration value - Training completion now shows formatted metric summary TECHNICAL NOTES: - Both RLConfig and SFTConfig had the same LoRA bug (both fixed) - RLTrainer.log_metrics override remains as it's actually used (called from training_step) - SFTTrainer.log_metrics override was never called, so removing it is safe FILES MODIFIED: - packages/verifiers-rl/verifiers_rl/rl/trainer/config.py - packages/verifiers-rl/verifiers_rl/scripts/sft.py - packages/verifiers-rl/verifiers_rl/rl/trainer/trainer.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

packages/verifiers-rl/verifiers_rl/rl/trainer/config.py

packages/verifiers-rl/verifiers_rl/rl/trainer/trainer.py

ROOT CAUSE: Three new bugs identified by Cursor Bugbot after previous fixes: 1. Output directory becomes literal "outputs/None" when both output_dir and run_name are None 2. vLLM API call uses empty string for model name, causing API errors 3. Sample generation runs on all distributed processes instead of only main process CHANGES: - Added check for None run_name in SFTConfig.__post_init__ to prevent "outputs/None" path - Stored model_name as self.model_name in SFTTrainer.__init__ - Changed vLLM API call to use self.model_name instead of empty string - Added process_index == 0 check to ensure sample generation only runs on main process IMPACT: - Output directory is now properly set to "outputs" when run_name is None - vLLM API calls now include the correct model name - Sample generation no longer runs on all distributed processes, avoiding duplicate API calls TECHNICAL NOTES: - vLLM's OpenAI-compatible API requires a model name parameter - Distributed training should only generate samples on rank 0 to avoid duplicate work - Default behavior now: outputs to "outputs/" directory when no run_name specified FILES MODIFIED: - packages/verifiers-rl/verifiers_rl/rl/trainer/config.py - packages/verifiers-rl/verifiers_rl/rl/trainer/trainer.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

packages/verifiers-rl/verifiers_rl/rl/trainer/config.py

packages/verifiers-rl/verifiers_rl/rl/trainer/trainer.py

ROOT CAUSE: Three critical bugs identified by Cursor Bugbot: 1. world_size accessed before super().__post_init__(), causing wrong gradient accumulation in multi-GPU 2. Fallback labels include pad tokens, causing model to train on padding 3. Repeated asyncio.run() in loop breaks async client after first sample CHANGES: - Moved super().__post_init__() before gradient accumulation calculation in SFTConfig - Added pad token masking with -100 in compute_loss when labels are missing - Refactored vLLM sampling to use single asyncio.run() with async generator function - Created _generate_samples_async() method to handle all async operations in one event loop IMPACT: - Multi-GPU training now has correct gradient accumulation and effective batch size - Model no longer trains on pad tokens, improving training quality - vLLM sampling now works correctly for multiple samples instead of failing after first TECHNICAL NOTES: - world_size is initialized during TrainingArguments.__post_init__(), must be called before access - HuggingFace ignores -100 labels in loss calculation, not pad_token_id - asyncio.run() creates new event loop each call; reusing AsyncOpenAI client requires single loop FILES MODIFIED: - packages/verifiers-rl/verifiers_rl/rl/trainer/config.py - packages/verifiers-rl/verifiers_rl/rl/trainer/trainer.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-02-19T07:37:46Z

packages/verifiers-rl/verifiers_rl/rl/trainer/trainer.py

+        # Optional: Generate samples with vLLM (only on main process)
+        if self.use_vllm and self.vllm_client and self.process_index == 0:
+            if self.state.global_step % self.vllm_sample_every_n_steps == 0:
+                self._generate_and_log_samples()


vLLM sampling triggers multiple times per global step

High Severity

The training_step method is called once per micro-batch by HuggingFace's Trainer, but self.state.global_step only increments after gradient_accumulation_steps micro-batches. The check self.state.global_step % self.vllm_sample_every_n_steps == 0 evaluates to True for every micro-batch in that gradient accumulation window. With default settings (batch_size=512, micro_batch_size=8, 1 GPU), gradient_accumulation_steps=64, so _generate_and_log_samples() fires 64 times per trigger — each calling blocking asyncio.run() with vLLM API requests and logging duplicate samples.

cursor · 2026-02-19T07:37:46Z

packages/verifiers-rl/verifiers_rl/rl/trainer/config.py

+        # Calculate gradient accumulation steps to achieve effective batch_size
+        # Must be AFTER super().__post_init__() so world_size is properly initialized
+        num_processes = self.world_size
+        self.gradient_accumulation_steps = self.batch_size // (self.micro_batch_size * num_processes)


Gradient accumulation steps set after parent initialization

Medium Severity

gradient_accumulation_steps is computed after super().__post_init__() to access world_size, but TrainingArguments.__post_init__() uses gradient_accumulation_steps internally for setup (e.g., DeepSpeed config, logging effective batch size). This means the parent class initializes using the default value of 1 rather than the correct computed value, which can cause misconfiguration in distributed setups or when DeepSpeed is enabled.

chore: remove example config and test files

4dfc294

Removes: - configs/local/vf-sft/example-sft.toml (example config) - tests/test_sft_trainer.py (test suite) These were created for development/testing purposes.

cursor bot reviewed Feb 19, 2026

View reviewed changes

packages/verifiers-rl/verifiers_rl/rl/trainer/config.py Show resolved Hide resolved

packages/verifiers-rl/verifiers_rl/rl/trainer/trainer.py Show resolved Hide resolved

packages/verifiers-rl/verifiers_rl/rl/trainer/trainer.py Show resolved Hide resolved

cursor bot reviewed Feb 19, 2026

View reviewed changes

packages/verifiers-rl/verifiers_rl/rl/trainer/config.py Show resolved Hide resolved

packages/verifiers-rl/verifiers_rl/scripts/sft.py Outdated Show resolved Hide resolved

packages/verifiers-rl/verifiers_rl/rl/trainer/trainer.py Outdated Show resolved Hide resolved

cursor bot reviewed Feb 19, 2026

View reviewed changes

packages/verifiers-rl/verifiers_rl/rl/trainer/config.py Outdated Show resolved Hide resolved

packages/verifiers-rl/verifiers_rl/rl/trainer/trainer.py Outdated Show resolved Hide resolved

packages/verifiers-rl/verifiers_rl/rl/trainer/trainer.py Show resolved Hide resolved

cursor bot reviewed Feb 19, 2026

View reviewed changes

packages/verifiers-rl/verifiers_rl/rl/trainer/config.py Outdated Show resolved Hide resolved

packages/verifiers-rl/verifiers_rl/rl/trainer/trainer.py Outdated Show resolved Hide resolved

packages/verifiers-rl/verifiers_rl/rl/trainer/trainer.py Outdated Show resolved Hide resolved

cursor bot reviewed Feb 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feat(rl): Add SFTTrainer for supervised fine-tuning#935

feat(rl): Add SFTTrainer for supervised fine-tuning#935
gspeter-max wants to merge 6 commits intoPrimeIntellect-ai:mainfrom
gspeter-max:verifiersContribution/issue_#752

gspeter-max commented Feb 19, 2026 •

edited by cursor bot

Loading

Uh oh!

CLAassistant commented Feb 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Feb 19, 2026

Uh oh!

cursor bot Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

gspeter-max commented Feb 19, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Core Implementation

Key Features

Files Modified

Usage Example

Design Philosophy

Uh oh!

CLAassistant commented Feb 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Feb 19, 2026

Choose a reason for hiding this comment

vLLM sampling triggers multiple times per global step

Uh oh!

cursor bot Feb 19, 2026

Choose a reason for hiding this comment

Gradient accumulation steps set after parent initialization

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gspeter-max commented Feb 19, 2026 •

edited by cursor bot

Loading