- 2025-07-14: X-MoE's code released
- 2025-06-26: X-MoE has been accepted at SC 2025 and received Best Student Paper Nomination
X-MoE is an optimized cross-platform framework for training large-scale expert-specialized Mixture-of-Experts (MoE) models (e.g., DeepSeek-MoE style). It introduces system-level enhancements for improved end-to-end throughput and memory efficiency.
This project is built on top of DeepSpeed and Megatron-DeepSpeed.
- Flexible Training Modes: Support both token-dropping and no-token-dropping training
- Padding-Free Design: Eliminates all zero-paddings to save communication and memory
- Communication Optimization: Reduce inter-node communication overhead on hierarchical networks
- Memory Efficiency: Hybrid TP+SP strategy to reducing activation memory
- Cross-Platform Compatibility: Heterogeneous GPU support
X-MoE introduces PFT (Padding-Free Token buffers), which eliminates zero-padding through MoE computation and communication stages. We use Triton-based kernels to handle sparse and irregular workloads.
A hierarchical multi-stage dispatching process that eliminates redundant inter-node communication by using pilot tokens and local replicas, reducing communication overhead on repeated tokens. It is implemented with Torch.distributed and Triton features.
A hybrid parallelism strategy that combines tensor-slicing with sequence-sharded execution for MoE blocks, reducing activation memory by a factor of the TP group size while maintaining compatibility with standard MoE routing.
For DeepSeek-style MoE training, activation can easily become a bottleneck. Please check our paper for analysis and learn about the use case.
We use Megatron-DeepSpeed as our end-to-end training system. To use the Megatron backend, you need to install the platform-specific version of APEX, which may take some time. For efficient end-to-end training, we also recommend installing Flash Attention.
Please follow the commands below to install the corresponding versions of APEX and FlashAttention:
# Install Apex with CUDA extensions
git clone https://github.com/NVIDIA/apex && cd apex
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
# Install FlashAttention (Optional)
pip install flash-attn --no-build-isolationInstallation on AMD platforms can be more complex, so we provide a detailed guide. We strongly recommend referring to the Frontier (MI250X) installation guide for step-by-step instructions.
π‘ Frontier Installation Guide
[Special] For Frontier Supercomputer Users: We have a pre-built environment shared on Frontier. You may refer to the installation guide above and directly use that environment.
After installation of the prerequisites, you can now install X-MoE following the commands below:
cd ~
git clone https://github.com/Supercomputing-System-AI-Lab/X-MoE
cd X-MoE
git submodule update --init --recursive --remote
pip install -e .
cd Megatron-DeepSpeed-X-MoE && pip install -e .Before launching your own training, you need to prepare the data to Megatron's data format. You may refer to the Megatron Data Preparation Guide.
If you just want to test the X-MoE training, we also provide a script to prepare the sample dataset:
cd ~/X-MoE/Megatron-DeepSpeed-X-MoE/examples_xmoe/data
./prepare_data_ae.shWe provide training examples with two launching methods: torchrun and srun. Below is the structure of the example scripts we provided:
examples_xmoe/
βββ scripts/ # scripts using torchrun to launch; tested with NVIDIA A100 node
β βββ X-MoE-Small-node-1.sh
β βββ ...
βββ scripts-frontier/ # scripts using srun to launch; tested on Frontier (MI250X)
βββ n8-Small-XMoE.slurm
βββ ...Quick Start Recommendation: Use X-MoE-Small-node-1.sh or n8-Small-XMoE.slurm to launch a 10B DeepSeek-MoE-like model training task on one GPU node with multiple GPUs.
cd ~/X-MoE/Megatron-DeepSpeed-X-MoE/examples_xmoe/scripts
./X-MoE-Small-node-1.sh <NUM_GPUS> <MICRO_BATCH_SIZE>cd ~/X-MoE/Megatron-DeepSpeed-X-MoE/examples_xmoe/scripts-frontier
./n8-Small-XMoE.slurmNote: The first run may require an additional 10 minutes to compile kernels.
Expectation: After initialization, you can see the training logs during training progress in the terminal. The training logs will also be saved as a log file. When using one computing node on Frontier (4*MI250X GPUs), the expected throughput is ~50-55 TFLOPs with X-MoE optimizations.
We integrated the X-MoE optimization API into Megatron's launching arguments. Specify these flags to enable basic optimizations:
--use-uneven-all-to-all \
--use-pftThese flags enable padding-free format training and corresponding kernels.
I. Redundancy-Bypassing Dispatch: For DeepSeek-style model training on multi-node settings, specify:
--use-rbdThis enables redundancy-bypassing dispatching and may reduce all-to-all communication time.
II. Sequence Sharding for Memory Optimization: If activation memory becomes a bottleneck, specify:
--tensor-model-parallel-size <TP_SIZE>When --enable-expert-tensor-parallelism is not specified, the above flag will automatically enable sequence sharding for MoE blocks.
Our evaluation on the Frontier supercomputer demonstrates that X-MoE enables:
- Training of models up to 545B parameters on 1024 AMD GPUs
- 10Γ larger than existing solutions
- Up to 1.42Γ higher training throughput
@misc{yuan2025xmoeenablingscalabletraining,
title={X-MoE: Enabling Scalable Training for Emerging Mixture-of-Experts Architectures on HPC Platforms},
author={Yueming Yuan and Ahan Gupta and Jianping Li and Sajal Dash and Feiyi Wang and Minjia Zhang},
year={2025},
eprint={2508.13337},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2508.13337},
}

