Asynchronous AllGather #55
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Currently, all AllGather calls of the data-parallel Muon implementation are synchronous. This means that after orthogonalizing a gradient and updating its corresponding parameter, each GPU must wait for every other GPU to finish processing its parameter. We can make this faster by overlapping computation and communication, and just synchronizing at the end of the optimization step.
The modification is very simple. Replace this:
with:
Speed-up
I tested this on a 1B transformer model trained on 8xA100-80GB with DDP and observed a 20% speed-up in the optimization step when using the asynchronous version.
The speed-up will be even larger on models where the number of layers is not a multiple of the number of GPUs.