Multi-GPU FSDP Support #175

humaira-rf · 2026-02-07T02:39:55Z

Summary

Integrated multi-GPU training via Fully Sharded Data Parallelism (FSDP), enabling large models to be distributed across multiple GPUs.

Key changes:

Added FSDP support for sharding model parameters, gradients, and optimizer states across GPUs
Extended RFModelConfig and experiment.run_fit APIs to accept a num_gpus argument
Updated scheduler logic to allocate runs across multiple GPUs

Testing:

Single-GPU (regression): Ran all original lite notebooks - no regressions
Multi-GPU FSDP:
1. Ran lite, standard, and large FSDP notebooks on downsampled data
2. Validated across multiple FSDP configurations (full shard, CPU offload, gradient checkpointing), multiple config leaves.
3. Tested all ICOps - resume, clone, delete, clone (warm-start)
4. Verified training plots and metric logging render correctly

… update test cases

… renamed terms

…n_fit

…ontroller and Worker

This reverts commit ea1e3b8.

…o feature/multi-gpu

arun-rfai · 2026-02-10T21:45:22Z

rapidfireai/fit/backend/controller.py

+                self.db.set_ic_ops_task_status(
+                    run_state["task_id"], TaskStatus.COMPLETED
+                )
+                self.db.set_ic_ops_task_status(


Aren't lines 300-302 is just repeating 297-299?

arun-rfai · 2026-02-10T21:46:00Z

rapidfireai/fit/backend/controller.py

+                self.db.set_ic_ops_task_status(
+                    run_state["task_id"], TaskStatus.COMPLETED
+                )
+                self.db.set_ic_ops_task_status(


3 LOC repeated again

arun-rfai · 2026-02-10T21:46:08Z

rapidfireai/fit/backend/controller.py

+                self.db.set_ic_ops_task_status(
+                    run_state["task_id"], TaskStatus.COMPLETED
+                )
+                self.db.set_ic_ops_task_status(


3 LOC repeated again

arun-rfai · 2026-02-10T21:46:16Z

rapidfireai/fit/backend/controller.py

+                self.logger.warning(
+                    f"Run {run_id} is already completed. Skipping Interactive Control task."
+                )
+                self.logger.warning(


3 LOC repeated again

arun-rfai · 2026-02-10T21:46:28Z

rapidfireai/fit/backend/controller.py

+                config_leaf["additional_kwargs"] = parent_run_details["config_leaf"][
+                    "additional_kwargs"
+                ]
+                config_leaf["additional_kwargs"] = parent_run_details["config_leaf"][


3 LOC repeated again

arun-rfai · 2026-02-10T21:47:05Z

rapidfireai/fit/backend/controller.py

+                raise ControllerException(
+                    f"Error creating model for run {parent_run_id}: {e}"
+                ) from e
+                self.ic_logger.error(


6 LOC repeated again

arun-rfai · 2026-02-10T21:47:21Z

rapidfireai/fit/backend/controller.py

+                    self.ic_logger.warning(
+                        f"Ignoring RESUME/STOP task for run {run_id} as it is already completed"
+                    )
+                    self.ic_logger.warning(


3 LOC repeated again

arun-rfai

Went through all code changes and new notebooks. Looks good to me except for one minor typo issue in controller.py: several lines of code are repeated here and there. I have marked up the locations with comments. Resolve those before merging.

pradyumna-rfai and others added 30 commits September 26, 2025 14:17

Scheduler: update to new scheduler that uses Monte Carlo simulations,…

8815089

… update test cases

Scheduler: updated scemantics to schedule runs by min chunks-visited,…

445c08f

… renamed terms

Db: updated db to include req_workers, estimated_runtime in runs table

572476d

AutoML: update linting

68ebf26

AutoML: add num_gpus in model_config

073a4ac

Db: add set_estimated_runtime func

f7889a1

Scheduler: expose monto carlo simulations as a param in experiment ru…

603076a

…n_fit

Db: add multi_worker_details as a field in Worker task table

9f6989d

Controller, Worker: update run_fit logic for multi-node training in C…

85539d1

…ontroller and Worker

Scheduler: modify scheduler to be fair round robin with Monte Carlo

1e1c948

fsdp initial changes

80feeed

full model fixes

2bfb92c

fixed gpu ids

05d198e

notebook params

2e704ce

notebooks for qlora

86fc486

error handling

391be75

Revert "error handling"

5ec02b3

This reverts commit ea1e3b8.

full model changes

f14be2f

Updated warm_started_from to warm_started (bool)

86b76c0

Controller: minor fixes from rebase

74369ef

Scheduler: restored scheduler from before rebase

ef97bf3

Scheduler: removed start_chunk_id from scheduler

389e2a1

Scheduler: updated minor comment

ba3f756

Worker: fixed runtime code, minor updates

8d806fa

Misc: dist_utils formatting changes

78094e6

Scheduler: minor changes to scheduler, added tests

ba52417

fsdp chnages: optimizer fixes, warm start bug fix

c63b307

corrected eff batch sizze, added suppression of warnings

1d1d0fa

notebooks updation

9b15503

Organized tutorial notebooks into subdirs

0281584

humaira-rf and others added 14 commits October 3, 2025 04:32

num_gpus correction, notebookupdate, vllm changes

6bb1ef0

Controller: Fixed clone modify race condition

809d5e2

temp changes to multi-gpu

5f383d2

Merge remote-tracking branch 'origin/feature/multi-gpu-scheduler' int…

d50f614

…o feature/multi-gpu

experiment, controller - merge fixes

048e62e

more merge fixes

d0eb447

scheduler fixes for single gpu

fda4e54

sft notebook updates

189638c

lite notebooks added

6743fde

working notebooks sft lite, normal

b8a2bec

fsdp notebook updated

f72e235

evaluation changes, num_gpus fix, icops-warm clone and delete

ca6b2cd

final saving checkpoint to disk, llama 70b changes

8e1554e

trl version reverted

2c22ee2

humaira-rf changed the title ~~Feature/multi gpu~~ Multi-GPU FSDP Support Feb 7, 2026

fixed linter errors

34ad039

humaira-rf requested a review from arun-rfai February 7, 2026 02:49

humaira-rf added 2 commits February 7, 2026 02:58

llama 70b num chunks increased

50da616

notebooks updated

19d0135

arun-rfai reviewed Feb 10, 2026

View reviewed changes

arun-rfai approved these changes Feb 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU FSDP Support #175

Multi-GPU FSDP Support #175

humaira-rf commented Feb 7, 2026 •

edited

Loading

Uh oh!

arun-rfai Feb 10, 2026

Uh oh!

arun-rfai Feb 10, 2026

Uh oh!

arun-rfai Feb 10, 2026

Uh oh!

arun-rfai Feb 10, 2026

Uh oh!

arun-rfai Feb 10, 2026

Uh oh!

arun-rfai Feb 10, 2026

Uh oh!

arun-rfai Feb 10, 2026

Uh oh!

arun-rfai left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Multi-GPU FSDP Support #175

Are you sure you want to change the base?

Multi-GPU FSDP Support #175

Conversation

humaira-rf commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key changes:

Testing:

Uh oh!

arun-rfai Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

arun-rfai Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

arun-rfai Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

arun-rfai Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

arun-rfai Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

arun-rfai Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

arun-rfai Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

arun-rfai left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

humaira-rf commented Feb 7, 2026 •

edited

Loading