Fix FT allreduce bug #1170

H-Huang · 2025-05-06T16:02:11Z

I was seeing this log / commit failure when using FT: should_commit=False enough_replicas=True, errored='NoneType' object has no attribute 'wait’

It was coming from an FSDP hook which was calling dist.allreduce() the regular c10d collective now returns None which causes the error above. Instead we should be using self.replicate_pg.allreduce(output, opts=ReduceOp.AVG). I think we could also achieve the same thing with manager.allreduce(output), so I'm not sure if the ManagedProcessGroup is needed. Would appreciate any thoughts @fegin @d4l3k

fegin

LGTM

fegin · 2025-05-06T16:34:45Z

torchtitan/components/ft.py

@@ -59,7 +58,7 @@ def get_dp_info(self, dp_degree: int, dp_rank: int) -> tuple[int, int]:

    def set_all_reduce_hook(self, model_parts: list[torch.nn.Module]) -> None:
        def all_reduce_hook(output):
-            dist.all_reduce(output, group=self.replicate_pg, op=ReduceOp.AVG)
+            self.replicate_pg.allreduce(output, opts=ReduceOp.AVG)


Can we add a comment discussing why we need to use this call instead of the original one? This is less intuitive.

Sure I will add a comment, this is due to these changes (pytorch/pytorch@35c45a4#diff-61109d1cb2a0bd13fc51d678a82666295289da1ec0a1a694e73d9e8c28f51bdcR2885, so the regular dist.all_reduce() is no longer compatible

@H-Huang can we fix this in the torchft side? We should just be able to ignore the case when there's no return work object iiuc

d4l3k

I think it's better to fix this in torchft side

d4l3k · 2025-05-06T17:40:55Z

https://github.com/pytorch/torchft/blob/main/torchft/process_group.py#L1074-L1078

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 6, 2025

H-Huang changed the title ~~Fix allreduce bug~~ Fix FT allreduce bug May 6, 2025

Fix ft allreduce bug

252fd91

H-Huang force-pushed the diloco branch from ee17949 to 252fd91 Compare May 6, 2025 16:08

H-Huang marked this pull request as ready for review May 6, 2025 16:10

H-Huang requested review from fegin and d4l3k May 6, 2025 16:10

fegin approved these changes May 6, 2025

View reviewed changes

d4l3k requested changes May 6, 2025

View reviewed changes

H-Huang mentioned this pull request May 8, 2025

Fix ManagedProcessGroup when used with c10d APIs pytorch/torchft#191

Merged

H-Huang closed this May 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix FT allreduce bug #1170

Fix FT allreduce bug #1170

Uh oh!

H-Huang commented May 6, 2025 •

edited

Loading

Uh oh!

fegin left a comment

Uh oh!

fegin May 6, 2025

Uh oh!

H-Huang May 6, 2025

Uh oh!

d4l3k May 6, 2025

Uh oh!

d4l3k left a comment

Uh oh!

d4l3k commented May 6, 2025

Uh oh!

Uh oh!

Fix FT allreduce bug #1170

Fix FT allreduce bug #1170

Uh oh!

Conversation

H-Huang commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fegin left a comment

Choose a reason for hiding this comment

Uh oh!

fegin May 6, 2025

Choose a reason for hiding this comment

Uh oh!

H-Huang May 6, 2025

Choose a reason for hiding this comment

Uh oh!

d4l3k May 6, 2025

Choose a reason for hiding this comment

Uh oh!

d4l3k left a comment

Choose a reason for hiding this comment

Uh oh!

d4l3k commented May 6, 2025

Uh oh!

Uh oh!

H-Huang commented May 6, 2025 •

edited

Loading