Feature/fit on ray #165

pradyumna-rfai · 2026-02-03T00:24:08Z

PR Summary: Unified Fit and Evals

This is a major refactoring PR that unifies the codebase for both fit (training) and evals (inference) modes, eliminating code duplication and enabling shared infrastructure for experiment tracking, interactive control, and metric logging and RF setup.

Changes

Major Changes

1. Unified Database Schema

Single experiments table for both fit and evals modes
- Mode-specific configuration stored in JSON config column
Unified interactive_control table for dynamic operations
- target_type field: 'run' (fit) or 'pipeline' (evals)
- target_id field: holds run_id or pipeline_id
- config_data field: holds operation-specific JSON configuration
- Supports operations: stop, resume, delete, clone, clone_warm
Mode-specific tables remain separate:
- Fit mode: runs, worker_task, controller_progress, worker_progress
- Evals mode: pipelines, contexts, actor_tasks

2. Unified Experiment Class

Single entry point Experiment(name, mode="fit"|"evals") for both modes
Mode-specific initialization:
- _init_fit_mode() - Sets up training infrastructure
- _init_evals_mode() - Sets up inference infrastructure
Shared methods:
- end() - Clean up resources
- cancel_current() - Cancel current operation
- get_log_file_path() - Get experiment logs
Mode-specific methods:
- run_fit() - Execute training (fit mode only)
- run_evals() - Execute inference (evals mode only)
- get_results() - Get training metrics (fit mode only)
- get_runs_info() - Get run information (fit mode only)

3. Unified Metric Logging System

4. Unified Status Enums

5. Setup

Unified setup for both fit and evals mode. Removed flags for --init command.
Added --clear command to clear all Db, logs and dashboard files.

Testing

Ran ChatQA lite notebook for SFT E2E with IC Ops - stop, clone
Ran DPO notebook for SFT E2E with IC Ops - stop, clone
Ran FIQA RAG notebook for evals E2E with IC Ops - stop, clone

Screenshots

pradyumna-rfai added 16 commits February 3, 2026 00:21

formatting changes

4854130

added ray to fit

0eaa0a2

unification of fit and evals

a2fd712

unified setup

b1a404d

minor changes to notebooks

50325ab

Removed old dirs, created metrics, platform dirs

d5f686b

bugfixes to check logging and IC Ops syntax

b1e8cac

Fixed auth issue, converted SHM model_registry to Ray

ac14993

small bugfix related to progress bar stderr

d5a92f0

added clean command to CLI

9952b08

fixed transformers version mismatch

e78c81f

fixed bug in Dispatcher stop, delete and get_run

2725231

fixed is_experiment_running bug and task_id error

06ded15

Fixed evals import issue.

a9cb65c

Fixed broken evals metrics logging

1fccdb3

Fixed IC Ops + Dispatcher issues for Evals mode

db3ef4f

pradyumna-rfai requested review from arun-rfai, david-rfai and humaira-rf February 3, 2026 00:24

pradyumna-rfai self-assigned this Feb 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/fit on ray #165

Feature/fit on ray #165

Uh oh!

pradyumna-rfai commented Feb 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Feature/fit on ray #165

Are you sure you want to change the base?

Feature/fit on ray #165

Uh oh!

Conversation

pradyumna-rfai commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary: Unified Fit and Evals

Changes

Major Changes

1. Unified Database Schema

2. Unified Experiment Class

3. Unified Metric Logging System

4. Unified Status Enums

5. Setup

Testing

Screenshots

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pradyumna-rfai commented Feb 3, 2026 •

edited

Loading