Skip to content

DeepEP, torch.compile and Fix Megatron Training Bug#646

Open
FurtherAI wants to merge 7 commits intomainfrom
austin/deepep_compile_and_trainability_main
Open

DeepEP, torch.compile and Fix Megatron Training Bug#646
FurtherAI wants to merge 7 commits intomainfrom
austin/deepep_compile_and_trainability_main

Conversation

@FurtherAI
Copy link
Copy Markdown
Collaborator

Summary: 1.15x faster Megatron training and it actually trains now.

DeepEP

DeepEP allows for faster expert parallel (EP) comm. EP comm and the pre/post-processing work surrounding it take roughly as much time as the actual expert MLP computation (on Qwen 3 30B A3B at least) so improvements to this are important. DeepEP gives us a ~1.05x speedup. The gap between DeepEP and Megatron may grow in multi-node settings.

torch.compile

We add torch.compile to the model layers and disable some regions that are not compatible. This gives a ~1.10x speedup on top of DeepEP. I did not test max-autotune or cuda graphs here, just basic compilation.

Megatron Training

We noticed that megatron failed to train in the simple yes-no-maybe example. This was caused by the parameter offload. Megatron expects param data tensors to stay constant and offload/reload creating new tensors caused Megatron to lose track of them for updates. We shift to Megatron's offload API to do this properly.

We also remove the optimizer offload, since the optimizer is loaded from disk at the start of each job anyways.

Megatron Provider Options

We expose env variables for controlling Megatron parallelism. We will refactor the configuration system at some point so that you can naturally modify these, but this is the minimal control plane.

@FurtherAI FurtherAI requested a review from Kovbo April 9, 2026 02:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant