Skip to content

Supercomputing-System-AI-Lab/X-MoE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

X-MoE: Cross-Platform Training Framework for Expert-Specialized MoE

X-MoE Overview

Project Page NVIDIA Support AMD Support

πŸ“’ News

  • 2025-07-14: X-MoE's code released
  • 2025-06-26: X-MoE has been accepted at SC 2025 and received Best Student Paper Nomination

πŸš€ About

X-MoE is an optimized cross-platform framework for training large-scale expert-specialized Mixture-of-Experts (MoE) models (e.g., DeepSeek-MoE style). It introduces system-level enhancements for improved end-to-end throughput and memory efficiency.

This project is built on top of DeepSpeed and Megatron-DeepSpeed.

✨ Features

Key Capabilities

  • Flexible Training Modes: Support both token-dropping and no-token-dropping training
  • Padding-Free Design: Eliminates all zero-paddings to save communication and memory
  • Communication Optimization: Reduce inter-node communication overhead on hierarchical networks
  • Memory Efficiency: Hybrid TP+SP strategy to reducing activation memory
  • Cross-Platform Compatibility: Heterogeneous GPU support

Padding-Free MoE Training Pipeline (PFT)

X-MoE introduces PFT (Padding-Free Token buffers), which eliminates zero-padding through MoE computation and communication stages. We use Triton-based kernels to handle sparse and irregular workloads.

Redundancy-Bypassing Dispatch (RBD)

A hierarchical multi-stage dispatching process that eliminates redundant inter-node communication by using pilot tokens and local replicas, reducing communication overhead on repeated tokens. It is implemented with Torch.distributed and Triton features.

Sequence-Sharded MoE Blocks (SSMB)

A hybrid parallelism strategy that combines tensor-slicing with sequence-sharded execution for MoE blocks, reducing activation memory by a factor of the TP group size while maintaining compatibility with standard MoE routing.

For DeepSeek-style MoE training, activation can easily become a bottleneck. Please check our paper for analysis and learn about the use case.

🏁 Quick Start

Prerequisites

We use Megatron-DeepSpeed as our end-to-end training system. To use the Megatron backend, you need to install the platform-specific version of APEX, which may take some time. For efficient end-to-end training, we also recommend installing Flash Attention.

For NVIDIA Users

Please follow the commands below to install the corresponding versions of APEX and FlashAttention:

# Install Apex with CUDA extensions
git clone https://github.com/NVIDIA/apex && cd apex
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

# Install FlashAttention (Optional)
pip install flash-attn --no-build-isolation

For AMD Users

Installation on AMD platforms can be more complex, so we provide a detailed guide. We strongly recommend referring to the Frontier (MI250X) installation guide for step-by-step instructions.

πŸ’‘ Frontier Installation Guide

[Special] For Frontier Supercomputer Users: We have a pre-built environment shared on Frontier. You may refer to the installation guide above and directly use that environment.

Installing X-MoE

After installation of the prerequisites, you can now install X-MoE following the commands below:

cd ~
git clone https://github.com/Supercomputing-System-AI-Lab/X-MoE
cd X-MoE
git submodule update --init --recursive --remote

pip install -e .
cd Megatron-DeepSpeed-X-MoE && pip install -e .

πŸƒβ€β™‚οΈ Running Training with X-MoE

πŸ“Š Data Preparation

Before launching your own training, you need to prepare the data to Megatron's data format. You may refer to the Megatron Data Preparation Guide.

If you just want to test the X-MoE training, we also provide a script to prepare the sample dataset:

cd ~/X-MoE/Megatron-DeepSpeed-X-MoE/examples_xmoe/data
./prepare_data_ae.sh

πŸš€ Launch Your Training Example

We provide training examples with two launching methods: torchrun and srun. Below is the structure of the example scripts we provided:

examples_xmoe/
β”œβ”€β”€ scripts/           # scripts using torchrun to launch; tested with NVIDIA A100 node
β”‚   β”œβ”€β”€ X-MoE-Small-node-1.sh
β”‚   └── ...
└── scripts-frontier/  # scripts using srun to launch; tested on Frontier (MI250X)
    β”œβ”€β”€ n8-Small-XMoE.slurm
    └── ...

Quick Start Recommendation: Use X-MoE-Small-node-1.sh or n8-Small-XMoE.slurm to launch a 10B DeepSeek-MoE-like model training task on one GPU node with multiple GPUs.

Option 1: Using torchrun (NVIDIA)

cd ~/X-MoE/Megatron-DeepSpeed-X-MoE/examples_xmoe/scripts
./X-MoE-Small-node-1.sh <NUM_GPUS> <MICRO_BATCH_SIZE>

Option 2: Using srun (Frontier Supercomputer)

cd ~/X-MoE/Megatron-DeepSpeed-X-MoE/examples_xmoe/scripts-frontier
./n8-Small-XMoE.slurm

Note: The first run may require an additional 10 minutes to compile kernels.

Expectation: After initialization, you can see the training logs during training progress in the terminal. The training logs will also be saved as a log file. When using one computing node on Frontier (4*MI250X GPUs), the expected throughput is ~50-55 TFLOPs with X-MoE optimizations.

βš™οΈ How to Enable X-MoE Optimizations?

Basic Optimizations

We integrated the X-MoE optimization API into Megatron's launching arguments. Specify these flags to enable basic optimizations:

--use-uneven-all-to-all \
--use-pft

These flags enable padding-free format training and corresponding kernels.

Advanced Optimizations (Large-scale, Multi-node)

I. Redundancy-Bypassing Dispatch: For DeepSeek-style model training on multi-node settings, specify:

--use-rbd

This enables redundancy-bypassing dispatching and may reduce all-to-all communication time.

II. Sequence Sharding for Memory Optimization: If activation memory becomes a bottleneck, specify:

--tensor-model-parallel-size <TP_SIZE>

When --enable-expert-tensor-parallelism is not specified, the above flag will automatically enable sequence sharding for MoE blocks.


πŸ“Š Evaluation

Our evaluation on the Frontier supercomputer demonstrates that X-MoE enables:

  • Training of models up to 545B parameters on 1024 AMD GPUs
  • 10Γ— larger than existing solutions
  • Up to 1.42Γ— higher training throughput

X-MoE Performance Results

πŸ“ Citation

@misc{yuan2025xmoeenablingscalabletraining,
      title={X-MoE: Enabling Scalable Training for Emerging Mixture-of-Experts Architectures on HPC Platforms}, 
      author={Yueming Yuan and Ahan Gupta and Jianping Li and Sajal Dash and Feiyi Wang and Minjia Zhang},
      year={2025},
      eprint={2508.13337},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2508.13337}, 
}

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published