feat(rl): Add SFTTrainer for supervised fine-tuning#935
feat(rl): Add SFTTrainer for supervised fine-tuning#935gspeter-max wants to merge 6 commits intoPrimeIntellect-ai:mainfrom
Conversation
ROOT CAUSE: - Issue PrimeIntellect-ai#752: Users need SFT support similar to RLTrainer - Currently must use external trl library with different interface - No consistent API between SFT and RL training workflows CHANGES: 1. Added SFTConfig class to verifiers_rl/rl/trainer/config.py - Extends TrainingArguments (same pattern as RLConfig) - 40 configuration fields including model loading, LoRA, batch args, training params - Optional vLLM integration for sample generation (disabled by default) - Auto-sets output_dir from run_name - Validates batch size divisibility 2. Added SFTTrainer class to verifiers_rl/rl/trainer/trainer.py - Extends Trainer (same base as RLTrainer) - Implements simple cross-entropy loss (not PPO) - No orchestrator (static dataset, no async rollouts) - Optional vLLM sampling for monitoring sample quality - Same logging/metrics patterns as RLTrainer 3. Updated exports in multiple __init__.py files - packages/verifiers-rl/verifiers_rl/rl/trainer/__init__.py - packages/verifiers-rl/verifiers_rl/__init__.py - verifiers/__init__.py (lazy imports, TYPE_CHECKING, rl_names) 4. Created SFT training script (verifiers_rl/scripts/sft.py) - CLI interface following pattern of train.py - TOML configuration support - Usage: vf-sft @ path/to/config.toml 5. Added vf-sft CLI entry point to pyproject.toml 6. Created example config (configs/local/vf-sft/example-sft.toml) 7. Created comprehensive test suite (tests/test_sft_trainer.py) IMPACT: - Provides consistent API between SFT and RL training - Enables easy SFT → RL workflows - Same configuration, logging, and model loading patterns - Optional vLLM integration for monitoring FILES MODIFIED: - packages/verifiers-rl/verifiers_rl/rl/trainer/config.py - packages/verifiers-rl/verifiers_rl/rl/trainer/trainer.py - packages/verifiers-rl/verifiers_rl/rl/trainer/__init__.py - packages/verifiers-rl/verifiers_rl/__init__.py - verifiers/__init__.py - packages/verifiers-rl/verifiers_rl/scripts/sft.py - packages/verifiers-rl/pyproject.toml - configs/local/vf-sft/example-sft.toml (created) - tests/test_sft_trainer.py (created) Refs: PrimeIntellect-ai#752
|
|
Removes: - configs/local/vf-sft/example-sft.toml (example config) - tests/test_sft_trainer.py (test suite) These were created for development/testing purposes.
ROOT CAUSE: - Bugbot identified 3 issues in the SFTTrainer implementation - Missing gradient_accumulation_steps calculation makes batch_size ineffective - Log method prints empty panels every step when use_vllm=False - No documentation added for new SFTTrainer/SFTConfig classes CHANGES: 1. Fixed gradient_accumulation_steps calculation in SFTConfig - Added calculation: gradient_accumulation_steps = batch_size // (micro_batch_size * num_processes) - This ensures effective batch size matches user-configured batch_size 2. Fixed log method to check for empty textual logs - Added check: if len(textual_logs["prompt"]) > 0 - Only prints samples and logs to W&B when samples are available - Prevents empty Rich panels and empty W&B tables 3. Added documentation for SFTTrainer and SFTConfig - Updated docs/training.md with SFT section and usage examples - Updated docs/reference.md with SFTConfig and SFTTrainer API docs - Updated skills/train-with-environments/SKILL.md with SFT → RL workflow IMPACT: - batch_size parameter now works correctly - No more empty console output or W&B tables - Users can understand and use SFTTrainer API FILES MODIFIED: - packages/verifiers-rl/verifiers_rl/rl/trainer/config.py - packages/verifiers-rl/verifiers_rl/rl/trainer/trainer.py - docs/training.md - docs/reference.md - skills/train-with-environments/SKILL.md Refs: PrimeIntellect-ai#935
ROOT CAUSE: Three separate bugs identified by Cursor Bugbot in PR PrimeIntellect-ai#935: 1. LoRA config parameters (dropout, modules_to_save, use_rslora) defined but not passed to LoraConfig 2. CLI script hardcoded dataset split to "train", ignoring config value 3. SFTTrainer.log_metrics override prevented parent's formatted metric summary CHANGES: - Added lora_dropout, modules_to_save, use_rslora to LoraConfig in RLConfig (line 302) - Added lora_dropout, modules_to_save, use_rslora to LoraConfig in SFTConfig (line 544) - Changed sft.py to use dataset_split from config instead of hardcoded "train" - Removed SFTTrainer.log_metrics override to restore parent's formatted output IMPACT: - Users can now configure LoRA dropout, modules to save, and RSLoRA - CLI script respects dataset_split configuration value - Training completion now shows formatted metric summary TECHNICAL NOTES: - Both RLConfig and SFTConfig had the same LoRA bug (both fixed) - RLTrainer.log_metrics override remains as it's actually used (called from training_step) - SFTTrainer.log_metrics override was never called, so removing it is safe FILES MODIFIED: - packages/verifiers-rl/verifiers_rl/rl/trainer/config.py - packages/verifiers-rl/verifiers_rl/scripts/sft.py - packages/verifiers-rl/verifiers_rl/rl/trainer/trainer.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ROOT CAUSE: Three new bugs identified by Cursor Bugbot after previous fixes: 1. Output directory becomes literal "outputs/None" when both output_dir and run_name are None 2. vLLM API call uses empty string for model name, causing API errors 3. Sample generation runs on all distributed processes instead of only main process CHANGES: - Added check for None run_name in SFTConfig.__post_init__ to prevent "outputs/None" path - Stored model_name as self.model_name in SFTTrainer.__init__ - Changed vLLM API call to use self.model_name instead of empty string - Added process_index == 0 check to ensure sample generation only runs on main process IMPACT: - Output directory is now properly set to "outputs" when run_name is None - vLLM API calls now include the correct model name - Sample generation no longer runs on all distributed processes, avoiding duplicate API calls TECHNICAL NOTES: - vLLM's OpenAI-compatible API requires a model name parameter - Distributed training should only generate samples on rank 0 to avoid duplicate work - Default behavior now: outputs to "outputs/" directory when no run_name specified FILES MODIFIED: - packages/verifiers-rl/verifiers_rl/rl/trainer/config.py - packages/verifiers-rl/verifiers_rl/rl/trainer/trainer.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ROOT CAUSE: Three critical bugs identified by Cursor Bugbot: 1. world_size accessed before super().__post_init__(), causing wrong gradient accumulation in multi-GPU 2. Fallback labels include pad tokens, causing model to train on padding 3. Repeated asyncio.run() in loop breaks async client after first sample CHANGES: - Moved super().__post_init__() before gradient accumulation calculation in SFTConfig - Added pad token masking with -100 in compute_loss when labels are missing - Refactored vLLM sampling to use single asyncio.run() with async generator function - Created _generate_samples_async() method to handle all async operations in one event loop IMPACT: - Multi-GPU training now has correct gradient accumulation and effective batch size - Model no longer trains on pad tokens, improving training quality - vLLM sampling now works correctly for multiple samples instead of failing after first TECHNICAL NOTES: - world_size is initialized during TrainingArguments.__post_init__(), must be called before access - HuggingFace ignores -100 labels in loss calculation, not pad_token_id - asyncio.run() creates new event loop each call; reusing AsyncOpenAI client requires single loop FILES MODIFIED: - packages/verifiers-rl/verifiers_rl/rl/trainer/config.py - packages/verifiers-rl/verifiers_rl/rl/trainer/trainer.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| # Optional: Generate samples with vLLM (only on main process) | ||
| if self.use_vllm and self.vllm_client and self.process_index == 0: | ||
| if self.state.global_step % self.vllm_sample_every_n_steps == 0: | ||
| self._generate_and_log_samples() |
There was a problem hiding this comment.
vLLM sampling triggers multiple times per global step
High Severity
The training_step method is called once per micro-batch by HuggingFace's Trainer, but self.state.global_step only increments after gradient_accumulation_steps micro-batches. The check self.state.global_step % self.vllm_sample_every_n_steps == 0 evaluates to True for every micro-batch in that gradient accumulation window. With default settings (batch_size=512, micro_batch_size=8, 1 GPU), gradient_accumulation_steps=64, so _generate_and_log_samples() fires 64 times per trigger — each calling blocking asyncio.run() with vLLM API requests and logging duplicate samples.
| # Calculate gradient accumulation steps to achieve effective batch_size | ||
| # Must be AFTER super().__post_init__() so world_size is properly initialized | ||
| num_processes = self.world_size | ||
| self.gradient_accumulation_steps = self.batch_size // (self.micro_batch_size * num_processes) |
There was a problem hiding this comment.
Gradient accumulation steps set after parent initialization
Medium Severity
gradient_accumulation_steps is computed after super().__post_init__() to access world_size, but TrainingArguments.__post_init__() uses gradient_accumulation_steps internally for setup (e.g., DeepSpeed config, logging effective batch size). This means the parent class initializes using the default value of 1 rather than the correct computed value, which can cause misconfiguration in distributed setups or when DeepSpeed is enabled.


Summary
Implements #752: Adds
SFTTrainerandSFTConfigclasses to provide a consistent API for supervised fine-tuning, matching the pattern of the existingRLTrainer.Changes
Core Implementation
TrainingArgumentswith 40 fields for model loading, LoRA, batch parameters, training parameters, and optional vLLM integrationTrainerwith simple cross-entropy loss (no PPO), optional vLLM sampling, and same logging/metrics patterns asRLTrainerKey Features
vf-sft @ path/to/config.tomlFiles Modified
packages/verifiers-rl/verifiers_rl/rl/trainer/config.py- Added SFTConfigpackages/verifiers-rl/verifiers_rl/rl/trainer/trainer.py- Added SFTTrainerpackages/verifiers-rl/verifiers_rl/rl/trainer/__init__.py- Added exportspackages/verifiers-rl/verifiers_rl/__init__.py- Added exportsverifiers/__init__.py- Added lazy importspackages/verifiers-rl/verifiers_rl/scripts/sft.py- Created SFT scriptpackages/verifiers-rl/pyproject.toml- Added vf-sft entry pointUsage Example
Or via CLI:
Example TOML config:
Design Philosophy
Closes #752
Note
Medium Risk
Introduces new training-path code and a new CLI entrypoint, plus touches shared config/LoRA initialization; issues would mainly affect training runs and logging rather than core runtime behavior.
Overview
Adds supervised fine-tuning support alongside the existing RL trainer by introducing
SFTConfig(aTrainingArguments-based config that auto-computesgradient_accumulation_stepsfrombatch_size/micro_batch_sizeand reuses the LoRA setup) andSFTTrainer(aTrainersubclass using standard cross-entropy loss, with optional vLLM-based sample generation and logging).Exposes the new APIs via
verifiers_rlandverifierslazy imports, adds a newvf-sftTOML-driven CLI script, and updates documentation (docs/reference.md,docs/training.md, and the training skill guide) with usage examples and an SFT → RL workflow.Written by Cursor Bugbot for commit 2475bf7. This will update automatically on new commits. Configure here.