Skip to content

Conversation

@humaira-rf
Copy link
Collaborator

@humaira-rf humaira-rf commented Feb 7, 2026

Summary

  • Integrated multi-GPU training via Fully Sharded Data Parallelism (FSDP), enabling large models to be distributed across multiple GPUs.

Key changes:

  • Added FSDP support for sharding model parameters, gradients, and optimizer states across GPUs
  • Extended RFModelConfig and experiment.run_fit APIs to accept a num_gpus argument
  • Updated scheduler logic to allocate runs across multiple GPUs

Testing:

  • Single-GPU (regression): Ran all original lite notebooks - no regressions

  • Multi-GPU FSDP:

    1. Ran lite, standard, and large FSDP notebooks on downsampled data
    2. Validated across multiple FSDP configurations (full shard, CPU offload, gradient checkpointing), multiple config leaves.
    3. Tested all ICOps - resume, clone, delete, clone (warm-start)
    4. Verified training plots and metric logging render correctly

pradyumna-rfai and others added 30 commits September 26, 2025 14:17
This reverts commit ea1e3b8.
@humaira-rf humaira-rf changed the title Feature/multi gpu Multi-GPU FSDP Support Feb 7, 2026
@humaira-rf humaira-rf requested a review from arun-rfai February 7, 2026 02:49
self.db.set_ic_ops_task_status(
run_state["task_id"], TaskStatus.COMPLETED
)
self.db.set_ic_ops_task_status(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't lines 300-302 is just repeating 297-299?

self.db.set_ic_ops_task_status(
run_state["task_id"], TaskStatus.COMPLETED
)
self.db.set_ic_ops_task_status(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 LOC repeated again

self.db.set_ic_ops_task_status(
run_state["task_id"], TaskStatus.COMPLETED
)
self.db.set_ic_ops_task_status(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 LOC repeated again

self.logger.warning(
f"Run {run_id} is already completed. Skipping Interactive Control task."
)
self.logger.warning(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 LOC repeated again

config_leaf["additional_kwargs"] = parent_run_details["config_leaf"][
"additional_kwargs"
]
config_leaf["additional_kwargs"] = parent_run_details["config_leaf"][
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 LOC repeated again

raise ControllerException(
f"Error creating model for run {parent_run_id}: {e}"
) from e
self.ic_logger.error(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 LOC repeated again

self.ic_logger.warning(
f"Ignoring RESUME/STOP task for run {run_id} as it is already completed"
)
self.ic_logger.warning(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 LOC repeated again

Copy link
Collaborator

@arun-rfai arun-rfai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went through all code changes and new notebooks. Looks good to me except for one minor typo issue in controller.py: several lines of code are repeated here and there. I have marked up the locations with comments. Resolve those before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants