Asynchronous AllGather #55

Niccolo-Ajroldi · 2025-11-03T16:31:39Z

Currently, all AllGather calls of the data-parallel Muon implementation are synchronous. This means that after orthogonalizing a gradient and updating its corresponding parameter, each GPU must wait for every other GPU to finish processing its parameter. We can make this faster by overlapping computation and communication, and just synchronizing at the end of the optimization step.

The modification is very simple. Replace this:

for base_i in ...:
    dist.all_gather(...)

with:

handles = []
for base_i in ...:
    handle = dist.all_gather(..., async_op=True)
    handles.append(handle)

for handle in handles:
    handle.wait()

Speed-up

I tested this on a 1B transformer model trained on 8xA100-80GB with DDP and observed a 20% speed-up in the optimization step when using the asynchronous version.

The speed-up will be even larger on models where the number of layers is not a multiple of the number of GPUs.

async AllGather

1430c34

Niccolo-Ajroldi changed the title ~~Make AllGather asynchronous in~~ Asynchronous AllGather Nov 3, 2025

Niccolo-Ajroldi and others added 4 commits November 4, 2025 09:10

async AllGather

07c2db3

Refactor all_gather futures

1dbeb15

Refactor future waiting

2e7a023

Update muon.py

7b3d660

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Asynchronous AllGather #55

Asynchronous AllGather #55

Uh oh!

Niccolo-Ajroldi commented Nov 3, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Asynchronous AllGather #55

Are you sure you want to change the base?

Asynchronous AllGather #55

Uh oh!

Conversation

Niccolo-Ajroldi commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Speed-up

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Niccolo-Ajroldi commented Nov 3, 2025 •

edited

Loading