Skip to content

Comments

feat(rl): Add SFTTrainer for supervised fine-tuning#935

Open
gspeter-max wants to merge 6 commits intoPrimeIntellect-ai:mainfrom
gspeter-max:verifiersContribution/issue_#752
Open

feat(rl): Add SFTTrainer for supervised fine-tuning#935
gspeter-max wants to merge 6 commits intoPrimeIntellect-ai:mainfrom
gspeter-max:verifiersContribution/issue_#752

Conversation

@gspeter-max
Copy link

@gspeter-max gspeter-max commented Feb 19, 2026

Summary

Implements #752: Adds SFTTrainer and SFTConfig classes to provide a consistent API for supervised fine-tuning, matching the pattern of the existing RLTrainer.

Changes

Core Implementation

  • SFTConfig: Configuration class extending TrainingArguments with 40 fields for model loading, LoRA, batch parameters, training parameters, and optional vLLM integration
  • SFTTrainer: Training class extending Trainer with simple cross-entropy loss (no PPO), optional vLLM sampling, and same logging/metrics patterns as RLTrainer

Key Features

  • ✅ Consistent API with RLTrainer (same config structure, same initialization)
  • ✅ Reuses existing utilities (model loading, logging, LoRA setup)
  • ✅ Optional vLLM integration for sample monitoring (not required)
  • ✅ Enables easy SFT → RL workflows
  • ✅ CLI interface: vf-sft @ path/to/config.toml

Files Modified

  • packages/verifiers-rl/verifiers_rl/rl/trainer/config.py - Added SFTConfig
  • packages/verifiers-rl/verifiers_rl/rl/trainer/trainer.py - Added SFTTrainer
  • packages/verifiers-rl/verifiers_rl/rl/trainer/__init__.py - Added exports
  • packages/verifiers-rl/verifiers_rl/__init__.py - Added exports
  • verifiers/__init__.py - Added lazy imports
  • packages/verifiers-rl/verifiers_rl/scripts/sft.py - Created SFT script
  • packages/verifiers-rl/pyproject.toml - Added vf-sft entry point

Usage Example

import verifiers as vf
from datasets import load_dataset

# Load dataset
dataset = load_dataset("willcb/V3-wordle", split="train")

# Create config
config = vf.SFTConfig(
    run_name="wordle-sft",
    max_steps=500,
    learning_rate=2e-5,
    batch_size=512,
    micro_batch_size=8,
)

# Create trainer
trainer = vf.SFTTrainer(
    model="Qwen/Qwen3-4B-Instruct",
    train_dataset=dataset,
    args=config,
)

# Train
trainer.train()

Or via CLI:

vf-sft @ path/to/config.toml

Example TOML config:

model = "Qwen/Qwen3-4B-Instruct"
dataset = "willcb/V3-wordle"

[sft]
run_name = "wordle-sft"
max_steps = 500
learning_rate = 2e-5
batch_size = 512
micro_batch_size = 8
use_lora = true
lora_rank = 8

Design Philosophy

  • Similar Structure: SFTTrainer follows the same class structure as RLTrainer
  • Simpler Internals: SFT is inherently simpler (no rollouts, no advantages, no orchestrator)
  • Optional vLLM: For monitoring sample quality during training (not required)
  • Reuses Existing Utilities: Leverages all existing model loading, logging, and utilities

Closes #752


Note

Medium Risk
Introduces new training-path code and a new CLI entrypoint, plus touches shared config/LoRA initialization; issues would mainly affect training runs and logging rather than core runtime behavior.

Overview
Adds supervised fine-tuning support alongside the existing RL trainer by introducing SFTConfig (a TrainingArguments-based config that auto-computes gradient_accumulation_steps from batch_size/micro_batch_size and reuses the LoRA setup) and SFTTrainer (a Trainer subclass using standard cross-entropy loss, with optional vLLM-based sample generation and logging).

Exposes the new APIs via verifiers_rl and verifiers lazy imports, adds a new vf-sft TOML-driven CLI script, and updates documentation (docs/reference.md, docs/training.md, and the training skill guide) with usage examples and an SFT → RL workflow.

Written by Cursor Bugbot for commit 2475bf7. This will update automatically on new commits. Configure here.

ROOT CAUSE:
- Issue PrimeIntellect-ai#752: Users need SFT support similar to RLTrainer
- Currently must use external trl library with different interface
- No consistent API between SFT and RL training workflows

CHANGES:
1. Added SFTConfig class to verifiers_rl/rl/trainer/config.py
   - Extends TrainingArguments (same pattern as RLConfig)
   - 40 configuration fields including model loading, LoRA, batch args, training params
   - Optional vLLM integration for sample generation (disabled by default)
   - Auto-sets output_dir from run_name
   - Validates batch size divisibility

2. Added SFTTrainer class to verifiers_rl/rl/trainer/trainer.py
   - Extends Trainer (same base as RLTrainer)
   - Implements simple cross-entropy loss (not PPO)
   - No orchestrator (static dataset, no async rollouts)
   - Optional vLLM sampling for monitoring sample quality
   - Same logging/metrics patterns as RLTrainer

3. Updated exports in multiple __init__.py files
   - packages/verifiers-rl/verifiers_rl/rl/trainer/__init__.py
   - packages/verifiers-rl/verifiers_rl/__init__.py
   - verifiers/__init__.py (lazy imports, TYPE_CHECKING, rl_names)

4. Created SFT training script (verifiers_rl/scripts/sft.py)
   - CLI interface following pattern of train.py
   - TOML configuration support
   - Usage: vf-sft @ path/to/config.toml

5. Added vf-sft CLI entry point to pyproject.toml

6. Created example config (configs/local/vf-sft/example-sft.toml)

7. Created comprehensive test suite (tests/test_sft_trainer.py)

IMPACT:
- Provides consistent API between SFT and RL training
- Enables easy SFT → RL workflows
- Same configuration, logging, and model loading patterns
- Optional vLLM integration for monitoring

FILES MODIFIED:
- packages/verifiers-rl/verifiers_rl/rl/trainer/config.py
- packages/verifiers-rl/verifiers_rl/rl/trainer/trainer.py
- packages/verifiers-rl/verifiers_rl/rl/trainer/__init__.py
- packages/verifiers-rl/verifiers_rl/__init__.py
- verifiers/__init__.py
- packages/verifiers-rl/verifiers_rl/scripts/sft.py
- packages/verifiers-rl/pyproject.toml
- configs/local/vf-sft/example-sft.toml (created)
- tests/test_sft_trainer.py (created)

Refs: PrimeIntellect-ai#752
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Removes:
- configs/local/vf-sft/example-sft.toml (example config)
- tests/test_sft_trainer.py (test suite)

These were created for development/testing purposes.
ROOT CAUSE:
- Bugbot identified 3 issues in the SFTTrainer implementation
- Missing gradient_accumulation_steps calculation makes batch_size ineffective
- Log method prints empty panels every step when use_vllm=False
- No documentation added for new SFTTrainer/SFTConfig classes

CHANGES:
1. Fixed gradient_accumulation_steps calculation in SFTConfig
   - Added calculation: gradient_accumulation_steps = batch_size // (micro_batch_size * num_processes)
   - This ensures effective batch size matches user-configured batch_size

2. Fixed log method to check for empty textual logs
   - Added check: if len(textual_logs["prompt"]) > 0
   - Only prints samples and logs to W&B when samples are available
   - Prevents empty Rich panels and empty W&B tables

3. Added documentation for SFTTrainer and SFTConfig
   - Updated docs/training.md with SFT section and usage examples
   - Updated docs/reference.md with SFTConfig and SFTTrainer API docs
   - Updated skills/train-with-environments/SKILL.md with SFT → RL workflow

IMPACT:
- batch_size parameter now works correctly
- No more empty console output or W&B tables
- Users can understand and use SFTTrainer API

FILES MODIFIED:
- packages/verifiers-rl/verifiers_rl/rl/trainer/config.py
- packages/verifiers-rl/verifiers_rl/rl/trainer/trainer.py
- docs/training.md
- docs/reference.md
- skills/train-with-environments/SKILL.md

Refs: PrimeIntellect-ai#935
ROOT CAUSE:
Three separate bugs identified by Cursor Bugbot in PR PrimeIntellect-ai#935:
1. LoRA config parameters (dropout, modules_to_save, use_rslora) defined but not passed to LoraConfig
2. CLI script hardcoded dataset split to "train", ignoring config value
3. SFTTrainer.log_metrics override prevented parent's formatted metric summary

CHANGES:
- Added lora_dropout, modules_to_save, use_rslora to LoraConfig in RLConfig (line 302)
- Added lora_dropout, modules_to_save, use_rslora to LoraConfig in SFTConfig (line 544)
- Changed sft.py to use dataset_split from config instead of hardcoded "train"
- Removed SFTTrainer.log_metrics override to restore parent's formatted output

IMPACT:
- Users can now configure LoRA dropout, modules to save, and RSLoRA
- CLI script respects dataset_split configuration value
- Training completion now shows formatted metric summary

TECHNICAL NOTES:
- Both RLConfig and SFTConfig had the same LoRA bug (both fixed)
- RLTrainer.log_metrics override remains as it's actually used (called from training_step)
- SFTTrainer.log_metrics override was never called, so removing it is safe

FILES MODIFIED:
- packages/verifiers-rl/verifiers_rl/rl/trainer/config.py
- packages/verifiers-rl/verifiers_rl/scripts/sft.py
- packages/verifiers-rl/verifiers_rl/rl/trainer/trainer.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ROOT CAUSE:
Three new bugs identified by Cursor Bugbot after previous fixes:
1. Output directory becomes literal "outputs/None" when both output_dir and run_name are None
2. vLLM API call uses empty string for model name, causing API errors
3. Sample generation runs on all distributed processes instead of only main process

CHANGES:
- Added check for None run_name in SFTConfig.__post_init__ to prevent "outputs/None" path
- Stored model_name as self.model_name in SFTTrainer.__init__
- Changed vLLM API call to use self.model_name instead of empty string
- Added process_index == 0 check to ensure sample generation only runs on main process

IMPACT:
- Output directory is now properly set to "outputs" when run_name is None
- vLLM API calls now include the correct model name
- Sample generation no longer runs on all distributed processes, avoiding duplicate API calls

TECHNICAL NOTES:
- vLLM's OpenAI-compatible API requires a model name parameter
- Distributed training should only generate samples on rank 0 to avoid duplicate work
- Default behavior now: outputs to "outputs/" directory when no run_name specified

FILES MODIFIED:
- packages/verifiers-rl/verifiers_rl/rl/trainer/config.py
- packages/verifiers-rl/verifiers_rl/rl/trainer/trainer.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ROOT CAUSE:
Three critical bugs identified by Cursor Bugbot:
1. world_size accessed before super().__post_init__(), causing wrong gradient accumulation in multi-GPU
2. Fallback labels include pad tokens, causing model to train on padding
3. Repeated asyncio.run() in loop breaks async client after first sample

CHANGES:
- Moved super().__post_init__() before gradient accumulation calculation in SFTConfig
- Added pad token masking with -100 in compute_loss when labels are missing
- Refactored vLLM sampling to use single asyncio.run() with async generator function
- Created _generate_samples_async() method to handle all async operations in one event loop

IMPACT:
- Multi-GPU training now has correct gradient accumulation and effective batch size
- Model no longer trains on pad tokens, improving training quality
- vLLM sampling now works correctly for multiple samples instead of failing after first

TECHNICAL NOTES:
- world_size is initialized during TrainingArguments.__post_init__(), must be called before access
- HuggingFace ignores -100 labels in loss calculation, not pad_token_id
- asyncio.run() creates new event loop each call; reusing AsyncOpenAI client requires single loop

FILES MODIFIED:
- packages/verifiers-rl/verifiers_rl/rl/trainer/config.py
- packages/verifiers-rl/verifiers_rl/rl/trainer/trainer.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

# Optional: Generate samples with vLLM (only on main process)
if self.use_vllm and self.vllm_client and self.process_index == 0:
if self.state.global_step % self.vllm_sample_every_n_steps == 0:
self._generate_and_log_samples()
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vLLM sampling triggers multiple times per global step

High Severity

The training_step method is called once per micro-batch by HuggingFace's Trainer, but self.state.global_step only increments after gradient_accumulation_steps micro-batches. The check self.state.global_step % self.vllm_sample_every_n_steps == 0 evaluates to True for every micro-batch in that gradient accumulation window. With default settings (batch_size=512, micro_batch_size=8, 1 GPU), gradient_accumulation_steps=64, so _generate_and_log_samples() fires 64 times per trigger — each calling blocking asyncio.run() with vLLM API requests and logging duplicate samples.

Fix in Cursor Fix in Web

# Calculate gradient accumulation steps to achieve effective batch_size
# Must be AFTER super().__post_init__() so world_size is properly initialized
num_processes = self.world_size
self.gradient_accumulation_steps = self.batch_size // (self.micro_batch_size * num_processes)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gradient accumulation steps set after parent initialization

Medium Severity

gradient_accumulation_steps is computed after super().__post_init__() to access world_size, but TrainingArguments.__post_init__() uses gradient_accumulation_steps internally for setup (e.g., DeepSpeed config, logging effective batch size). This means the parent class initializes using the default value of 1 rather than the correct computed value, which can cause misconfiguration in distributed setups or when DeepSpeed is enabled.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SFT Support similar to RLTrainer

3 participants