Skip to content

Conversation

@pradyumna-rfai
Copy link
Collaborator

@pradyumna-rfai pradyumna-rfai commented Feb 3, 2026

PR Summary: Unified Fit and Evals

This is a major refactoring PR that unifies the codebase for both fit (training) and evals (inference) modes, eliminating code duplication and enabling shared infrastructure for experiment tracking, interactive control, and metric logging and RF setup.

Changes

Major Changes

1. Unified Database Schema

  • Single experiments table for both fit and evals modes
    • Mode-specific configuration stored in JSON config column
  • Unified interactive_control table for dynamic operations
    • target_type field: 'run' (fit) or 'pipeline' (evals)
    • target_id field: holds run_id or pipeline_id
    • config_data field: holds operation-specific JSON configuration
    • Supports operations: stop, resume, delete, clone, clone_warm
  • Mode-specific tables remain separate:
    • Fit mode: runs, worker_task, controller_progress, worker_progress
    • Evals mode: pipelines, contexts, actor_tasks

2. Unified Experiment Class

  • Single entry point Experiment(name, mode="fit"|"evals") for both modes
  • Mode-specific initialization:
    • _init_fit_mode() - Sets up training infrastructure
    • _init_evals_mode() - Sets up inference infrastructure
  • Shared methods:
    • end() - Clean up resources
    • cancel_current() - Cancel current operation
    • get_log_file_path() - Get experiment logs
  • Mode-specific methods:
    • run_fit() - Execute training (fit mode only)
    • run_evals() - Execute inference (evals mode only)
    • get_results() - Get training metrics (fit mode only)
    • get_runs_info() - Get run information (fit mode only)

3. Unified Metric Logging System

4. Unified Status Enums

5. Setup

  • Unified setup for both fit and evals mode. Removed flags for --init command.
  • Added --clear command to clear all Db, logs and dashboard files.

Testing

  • Ran ChatQA lite notebook for SFT E2E with IC Ops - stop, clone
  • Ran DPO notebook for SFT E2E with IC Ops - stop, clone
  • Ran FIQA RAG notebook for evals E2E with IC Ops - stop, clone

Screenshots

Screenshot 2026-01-31 at 7 06 36 PM Screenshot 2026-01-31 at 7 41 09 PM Screenshot 2026-01-31 at 7 41 36 PM Screenshot 2026-02-02 at 3 11 20 PM Screenshot 2026-02-02 at 3 11 46 PM Screenshot 2026-02-02 at 3 46 11 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant