Generalize DiLoCo to support Streaming #205

tushar00jain · 2025-05-30T22:48:14Z

Summary:

Add option to perform qunatized allreduce in torchft manager
Update user level API's for DiLoCo to also support Streaming DiLoCo -- it takes a list of modules now as input
Create a class _StreamingDiLoCoFragment used by DiLoCo to support streaming. Each fragment independently determines its schedule (when to send/sync).
Adding support for "alpha" and "tao" parameters from the paper are left as a TODO. Plan to add this in a separate PR.

Test Plan:

$ pytest -vs torchft/local_sgd_integ_test.py
$ pytest -vs torchft/local_sgd_test.py

Summary: - Add option to perform qunatized allreduce in torchft manager - Update user level API's for DiLoCo to also support Streaming DiLoCo -- it takes a list of modules now as input - Create a class `_StreamingDiLoCoFragment` used by DiLoCo to support streaming. Each fragment independently determines its schedule (when to send/sync). - Adding support for "alpha" and "tao" parameters from the paper are left as a TODO. Plan to add this in a separate PR. Test Plan: ``` $ pytest -vs torchft/local_sgd_integ_test.py $ pytest -vs torchft/local_sgd_test.py ```

H-Huang

Nice!! Thanks for getting this out so quickly, just a few small comments but can also be addressed in follow up PRs

H-Huang · 2025-06-02T15:22:50Z

torchft/manager.py

@@ -267,7 +291,9 @@ def shutdown(self, wait: bool = True) -> None:
            self._manager.shutdown()
        self._executor.shutdown(wait=wait)

-    def allreduce(self, tensor: torch.Tensor) -> torch.futures.Future[torch.Tensor]:
+    def allreduce(
+        self, tensor: torch.Tensor, should_quantize: bool = False


I would also update some of the tests in manager_test.py to also include using the should_quantize flag. Can be done in a follow up PR

H-Huang · 2025-06-02T15:24:39Z

torchft/manager.py

+except ImportError:
+    from torch import cuda
+
+    def allreduce_quantized(


nit: is this stub necessary? Can't we just have a constant like TRITON_AVAILABLE and then check that in the if statement in the implementation

Less configuration options 🥲 let's say someone changes platforms, then they can start using triton automatically without having to modify the constant. Also avoids having us to configure CI properly for different platforms.

H-Huang · 2025-06-02T15:32:32Z

torchft/local_sgd.py

+    def __init__(
+        self,
+        manager: Manager,
+        model_fragments: List[nn.Module],


nit: maybe from an API / UX perspective we can support nn.Module | List[nn.Module] with the specification that passing in a single nn.Module means whole model.

Think it's better to have a smaller api surface and avoid having a special case?

H-Huang · 2025-06-02T15:33:15Z

torchft/local_sgd.py

+                model_fragment,
+                math.floor((sync_every / len(model_fragments)) * (i + 1)),
+                inner_optimizer,
+                # TODO: Support different outer optimizers for each fragment


oh interesting, is that mentioned in the paper?

I think they should be different otherwise things like momentum end up being the same for all fragments? Maybe it's not very important though

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 30, 2025

tushar00jain requested review from H-Huang and dzmitry-huba May 30, 2025 22:48

tushar00jain force-pushed the feature/streaming-diloco branch 2 times, most recently from cb34ff1 to 35ca865 Compare May 30, 2025 23:38

dzmitry-huba approved these changes May 30, 2025

View reviewed changes

tushar00jain force-pushed the feature/streaming-diloco branch from 35ca865 to ab00d7d Compare May 31, 2025 18:12

tushar00jain force-pushed the feature/streaming-diloco branch from ab00d7d to 0f07f2d Compare May 31, 2025 20:07

H-Huang approved these changes Jun 2, 2025

View reviewed changes

Fix errors importing triton

600864f

tushar00jain force-pushed the feature/streaming-diloco branch from 0f07f2d to 600864f Compare June 3, 2025 17:32

tushar00jain merged commit 2ac219d into pytorch:main Jun 3, 2025
8 checks passed

tushar00jain deleted the feature/streaming-diloco branch June 3, 2025 20:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Generalize DiLoCo to support Streaming #205

Generalize DiLoCo to support Streaming #205

tushar00jain commented May 30, 2025

Uh oh!

H-Huang left a comment

Uh oh!

H-Huang Jun 2, 2025

Uh oh!

H-Huang Jun 2, 2025

Uh oh!

tushar00jain Jun 3, 2025 •

edited

Loading

Uh oh!

H-Huang Jun 2, 2025

Uh oh!

tushar00jain Jun 3, 2025

Uh oh!

H-Huang Jun 2, 2025

Uh oh!

tushar00jain Jun 3, 2025

Uh oh!

Uh oh!

Uh oh!

Generalize DiLoCo to support Streaming #205

Generalize DiLoCo to support Streaming #205

Conversation

tushar00jain commented May 30, 2025

Uh oh!

H-Huang left a comment

Choose a reason for hiding this comment

Uh oh!

H-Huang Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

H-Huang Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

tushar00jain Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

H-Huang Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

tushar00jain Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

H-Huang Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

tushar00jain Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tushar00jain Jun 3, 2025 •

edited

Loading