[roadmap/tracker] Low precision MoE training

Creating this issue as a roadmap/tracker for enabling float8 training for MoEs with token-choice routing. Both core requirements as well as ideas for additional performance optimizations are included.

## Compute
- [ ] fp8 rowwise
    - [x] Add torch._scaled_grouped_mm kernel in core 
        - https://github.com/pytorch/pytorch/pull/150374 (by @ngimel)
    - [x] Add differentiable scaled grouped mm with dynamic float8 rowwise quant in torchao
        - https://github.com/pytorch/ao/pull/1969
    - [x] Add custom kernels in torchao for performing per-group scaling on device, to avoid host-device sync
        - https://github.com/pytorch/ao/pull/2064
        - https://github.com/pytorch/ao/pull/2077
    - [ ] Fuse padding of group sizes up to nearest multiple of 16 into the dynamic quant kernel
- [ ] mxpf8  
    - [ ] mxfp8 scaled grouped gemm https://github.com/pytorch/pytorch/issues/153502
    - [ ] torchao differentiable _scaled_grouped_mm support for mxpf8 recipe for dynamic quant before grouped GEMMs
    - [ ] triton kernels to do scaling per group without d2h sync

## Communication
I looked at traces and validated "all to all dispatch -> grouped gemm -> all to all combine" are all sequentially dependent, so in theory faster/low precision comms should improve performance. There is some overlap with the shared expert computation, but it is not 100% overlap, so there is room for optimization. This will be especially important if/when "all to all" spans multiple nodes, where inter-node network bandwidth is lower than the intra-node NVLink bandwidth. 

This is also inspired by the DeepSeekV3 [paper](https://arxiv.org/pdf/2412.19437) where, if I understand correctly, they do a2a dispatch in fp8 but keep a2a combine in bf16 as they found it was more sensitive to low precision during training.

- [ ] Add on device [all_to_all_v](https://github.com/pytorch/torchtitan/blob/f27a1843a503fadf06876a3797bd7305098917a7/torchtitan/experiments/deepseek_v3/symm_mem_recipes/triton_on_device_all_to_all_v.py#L56) kernels compatible with:
    - [ ] mxfp8 (P0)
    - [ ] float8 rowwise (P1)
 - [ ] token permutation kernel supports low precision dtypes by permuting scales to be in proper order for permuted tokens ([link](https://github.com/pytorch/torchtitan/blob/f27a1843a503fadf06876a3797bd7305098917a7/torchtitan/experiments/deepseek_v3/model.py#L833C52-L833C76))
    - [ ] mxfp8 (P0)
    - [ ] float8 rowwise (P1)
 
## Torchao UX
- [X] Add tensor subclass (ScaledGroupedMMTensor) with an op override for `torch.aten._grouped_mm` => runs differentiable scaled grouped mm
    - https://github.com/pytorch/ao/pull/2275
- [X] Add one line model conversion API, should recursively swap nn.Parameter data tensors of the expert weights with ScaledGroupedMMTensor. 
    - https://github.com/pytorch/ao/pull/2275
- [ ] support configurable recipe (fp8 rowwise, mxpf8) 

## Compile support
- [x] Compile support for `torch._grouped_mm`
    - done by @bdhirsh in https://github.com/pytorch/pytorch/pull/153384
- [ ] Differentiable _scaled_grouped_mm can compile with `fullgraph=True`
- [ ] E2E compilation of each TranformerBlock in torchtitan after MoE conversion via tensor subclass approach

## Distributed support
- [x] Composability with FSDP2 (will likely need something like [this](https://github.com/pytorch/ao/blob/1017c7e3bfe7300a14ed81fa36038684b168b633/torchao/float8/fsdp_utils.py#L129) for the new tensor subclass)
    - [ ] mxfp8 (P0)
    - [x] float8 rowwise (P1) https://github.com/pytorch/ao/pull/2413 
- [ ] Composability with TP
    - [ ] mxfp8 (P0)
    - [ ] float8 rowwise https://github.com/pytorch/ao/pull/2425 
- [ ] Composability with FSDP + TP 2D parallel 
    - [ ] mxfp8 (P0)
    - [ ] float8 rowwise (P1) https://github.com/pytorch/ao/pull/2425 
- [ ] Composability with tp2ep
    - [ ] mxfp8 (P0)
    - [ ] float8 rowwise (P1)
- [ ] Composability with dp2ep
    - [ ] mxfp8 (P0)
    - [ ] float8 rowwise (P1)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[roadmap/tracker] Low precision MoE training #2147

Compute

Communication

Torchao UX

Compile support

Distributed support

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[roadmap/tracker] Low precision MoE training #2147

Description

Compute

Communication

Torchao UX

Compile support

Distributed support

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions