X-MoE: Cross-Platform Training Framework for Expert-Specialized MoE

📢 News

2025-07-14: X-MoE's code released
2025-06-26: X-MoE has been accepted at SC 2025 and received Best Student Paper Nomination

🚀 About

X-MoE is an optimized cross-platform framework for training large-scale expert-specialized Mixture-of-Experts (MoE) models (e.g., DeepSeek-MoE style). It introduces system-level enhancements for improved end-to-end throughput and memory efficiency.

This project is built on top of DeepSpeed and Megatron-DeepSpeed.

✨ Features

Key Capabilities

Flexible Training Modes: Support both token-dropping and no-token-dropping training
Padding-Free Design: Eliminates all zero-paddings to save communication and memory
Communication Optimization: Reduce inter-node communication overhead on hierarchical networks
Memory Efficiency: Hybrid TP+SP strategy to reducing activation memory
Cross-Platform Compatibility: Heterogeneous GPU support

Padding-Free MoE Training Pipeline (PFT)

X-MoE introduces PFT (Padding-Free Token buffers), which eliminates zero-padding through MoE computation and communication stages. We use Triton-based kernels to handle sparse and irregular workloads.

Redundancy-Bypassing Dispatch (RBD)

A hierarchical multi-stage dispatching process that eliminates redundant inter-node communication by using pilot tokens and local replicas, reducing communication overhead on repeated tokens. It is implemented with Torch.distributed and Triton features.

Sequence-Sharded MoE Blocks (SSMB)

A hybrid parallelism strategy that combines tensor-slicing with sequence-sharded execution for MoE blocks, reducing activation memory by a factor of the TP group size while maintaining compatibility with standard MoE routing.

For DeepSeek-style MoE training, activation can easily become a bottleneck. Please check our paper for analysis and learn about the use case.

🏁 Quick Start

Prerequisites

We use Megatron-DeepSpeed as our end-to-end training system. To use the Megatron backend, you need to install the platform-specific version of APEX, which may take some time. For efficient end-to-end training, we also recommend installing Flash Attention.

For NVIDIA Users

Please follow the commands below to install the corresponding versions of APEX and FlashAttention:

# Install Apex with CUDA extensions
git clone https://github.com/NVIDIA/apex && cd apex
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

# Install FlashAttention (Optional)
pip install flash-attn --no-build-isolation

For AMD Users

Installation on AMD platforms can be more complex, so we provide a detailed guide. We strongly recommend referring to the Frontier (MI250X) installation guide for step-by-step instructions.

💡 Frontier Installation Guide

[Special] For Frontier Supercomputer Users: We have a pre-built environment shared on Frontier. You may refer to the installation guide above and directly use that environment.

Installing X-MoE

After installation of the prerequisites, you can now install X-MoE following the commands below:

cd ~
git clone https://github.com/Supercomputing-System-AI-Lab/X-MoE
cd X-MoE
git submodule update --init --recursive --remote

pip install -e .
cd Megatron-DeepSpeed-X-MoE && pip install -e .

🏃‍♂️ Running Training with X-MoE

📊 Data Preparation

Before launching your own training, you need to prepare the data to Megatron's data format. You may refer to the Megatron Data Preparation Guide.

If you just want to test the X-MoE training, we also provide a script to prepare the sample dataset:

cd ~/X-MoE/Megatron-DeepSpeed-X-MoE/examples_xmoe/data
./prepare_data_ae.sh

🚀 Launch Your Training Example

We provide training examples with two launching methods: torchrun and srun. Below is the structure of the example scripts we provided:

examples_xmoe/
├── scripts/           # scripts using torchrun to launch; tested with NVIDIA A100 node
│   ├── X-MoE-Small-node-1.sh
│   └── ...
└── scripts-frontier/  # scripts using srun to launch; tested on Frontier (MI250X)
    ├── n8-Small-XMoE.slurm
    └── ...

Quick Start Recommendation: Use X-MoE-Small-node-1.sh or n8-Small-XMoE.slurm to launch a 10B DeepSeek-MoE-like model training task on one GPU node with multiple GPUs.

Option 1: Using torchrun (NVIDIA)

cd ~/X-MoE/Megatron-DeepSpeed-X-MoE/examples_xmoe/scripts
./X-MoE-Small-node-1.sh <NUM_GPUS> <MICRO_BATCH_SIZE>

Option 2: Using srun (Frontier Supercomputer)

cd ~/X-MoE/Megatron-DeepSpeed-X-MoE/examples_xmoe/scripts-frontier
./n8-Small-XMoE.slurm

Note: The first run may require an additional 10 minutes to compile kernels.

Expectation: After initialization, you can see the training logs during training progress in the terminal. The training logs will also be saved as a log file. When using one computing node on Frontier (4*MI250X GPUs), the expected throughput is ~50-55 TFLOPs with X-MoE optimizations.

⚙️ How to Enable X-MoE Optimizations?

Basic Optimizations

We integrated the X-MoE optimization API into Megatron's launching arguments. Specify these flags to enable basic optimizations:

--use-uneven-all-to-all \
--use-pft

These flags enable padding-free format training and corresponding kernels.

Advanced Optimizations (Large-scale, Multi-node)

I. Redundancy-Bypassing Dispatch: For DeepSeek-style model training on multi-node settings, specify:

--use-rbd

This enables redundancy-bypassing dispatching and may reduce all-to-all communication time.

II. Sequence Sharding for Memory Optimization: If activation memory becomes a bottleneck, specify:

--tensor-model-parallel-size <TP_SIZE>

When --enable-expert-tensor-parallelism is not specified, the above flag will automatically enable sequence sharding for MoE blocks.

📊 Evaluation

Our evaluation on the Frontier supercomputer demonstrates that X-MoE enables:

Training of models up to 545B parameters on 1024 AMD GPUs
10× larger than existing solutions
Up to 1.42× higher training throughput

📝 Citation

@misc{yuan2025xmoeenablingscalabletraining,
      title={X-MoE: Enabling Scalable Training for Emerging Mixture-of-Experts Architectures on HPC Platforms}, 
      author={Yueming Yuan and Ahan Gupta and Jianping Li and Sajal Dash and Feiyi Wang and Minjia Zhang},
      year={2025},
      eprint={2508.13337},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2508.13337}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 2,600 Commits
.github		.github
Megatron-DeepSpeed-X-MoE @ 6cc4be6		Megatron-DeepSpeed-X-MoE @ 6cc4be6
accelerator		accelerator
azure		azure
benchmarks		benchmarks
bin		bin
blogs		blogs
csrc		csrc
deepspeed		deepspeed
docker		docker
docs		docs
examples		examples
images		images
op_builder		op_builder
release		release
requirements		requirements
scripts		scripts
tests		tests
x-moe-blog		x-moe-blog
x-moe-docs		x-moe-docs
.clang-format		.clang-format
.flake8		.flake8
.gitignore		.gitignore
.gitmodules		.gitmodules
.pre-commit-config.yaml		.pre-commit-config.yaml
.pylintrc		.pylintrc
.readthedocs.yml		.readthedocs.yml
.style.yapf		.style.yapf
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
COMMITTERS.md		COMMITTERS.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
MANIFEST_win.in		MANIFEST_win.in
README.md		README.md
SECURITY.md		SECURITY.md
build_win.bat		build_win.bat
environment.yml		environment.yml
install.sh		install.sh
setup.cfg		setup.cfg
setup.py		setup.py
version.txt		version.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

X-MoE: Cross-Platform Training Framework for Expert-Specialized MoE

📢 News

🚀 About

✨ Features

Key Capabilities

Padding-Free MoE Training Pipeline (PFT)

Redundancy-Bypassing Dispatch (RBD)

Sequence-Sharded MoE Blocks (SSMB)

🏁 Quick Start

Prerequisites

For NVIDIA Users

For AMD Users

Installing X-MoE

🏃‍♂️ Running Training with X-MoE

📊 Data Preparation

🚀 Launch Your Training Example

Option 1: Using torchrun (NVIDIA)

Option 2: Using srun (Frontier Supercomputer)

⚙️ How to Enable X-MoE Optimizations?

Basic Optimizations

Advanced Optimizations (Large-scale, Multi-node)

📊 Evaluation

📝 Citation

About

Uh oh!

Releases

Packages

Languages

License

Supercomputing-System-AI-Lab/X-MoE

Folders and files

Latest commit

History

Repository files navigation

X-MoE: Cross-Platform Training Framework for Expert-Specialized MoE

📢 News

🚀 About

✨ Features

Key Capabilities

Padding-Free MoE Training Pipeline (PFT)

Redundancy-Bypassing Dispatch (RBD)

Sequence-Sharded MoE Blocks (SSMB)

🏁 Quick Start

Prerequisites

For NVIDIA Users

For AMD Users

Installing X-MoE

🏃‍♂️ Running Training with X-MoE

📊 Data Preparation

🚀 Launch Your Training Example

Option 1: Using torchrun (NVIDIA)

Option 2: Using srun (Frontier Supercomputer)

⚙️ How to Enable X-MoE Optimizations?

Basic Optimizations

Advanced Optimizations (Large-scale, Multi-node)

📊 Evaluation

📝 Citation

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages