DeepEP, torch.compile and Fix Megatron Training Bug#646
Open
DeepEP, torch.compile and Fix Megatron Training Bug#646
torch.compile and Fix Megatron Training Bug#646Conversation
…_and_trainability_main # Conflicts: # src/art/megatron/train.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary: 1.15x faster Megatron training and it actually trains now.
DeepEP
DeepEP allows for faster expert parallel (EP) comm. EP comm and the pre/post-processing work surrounding it take roughly as much time as the actual expert MLP computation (on Qwen 3 30B A3B at least) so improvements to this are important. DeepEP gives us a ~1.05x speedup. The gap between DeepEP and Megatron may grow in multi-node settings.
torch.compileWe add
torch.compileto the model layers and disable some regions that are not compatible. This gives a ~1.10x speedup on top of DeepEP. I did not test max-autotune or cuda graphs here, just basic compilation.Megatron Training
We noticed that megatron failed to train in the simple yes-no-maybe example. This was caused by the parameter offload. Megatron expects param data tensors to stay constant and offload/reload creating new tensors caused Megatron to lose track of them for updates. We shift to Megatron's offload API to do this properly.
We also remove the optimizer offload, since the optimizer is loaded from disk at the start of each job anyways.
Megatron Provider Options
We expose env variables for controlling Megatron parallelism. We will refactor the configuration system at some point so that you can naturally modify these, but this is the minimal control plane.