Description
Propose to make RCEIL the default for MXFP scale derivation.
RCEIL is targeting to optimize the MXFP E8 scale factor w.r.t the ratio of block_amax / dtype_max
. An example formula is: X_E8 = to_e8(ceil(log2(block_amax / dtype_max)))
. Real implementation is achieved through bit manipulation. Nvidia Blackwell architecture has instruction cvt.rp.satfinite.ue8m0x2.f32
. TransformerEngine and CUTLASS have already adopted.
In contract, the OCP spec's ceil formula is X_E8 = to_e8(ceil(log2(block_amax)) - floor(log2(dtype_max)))
.
With block_amax = 150
and dtype_max = float8_e4m3.max = 448
, RCEIL X_E8 = to_e8(ceil(log2(150 / 448))) = 126
while OCP ceil X_E8 = to_e8(ceil(log2(150)) - floor(log2(448))) = 127
.
RCEIL is used to generate the cast-only slide for GTC 2025 to showcase direct-cast with MXFP8 could achieve iso-accuracy as BF16. We have also verified the model training and accuracy across various model sizes and token counts to ensure MXFP8 with RCEIL could achieve iso-accuracy with BF16.
The benefits of using RCEIL as the default is:
- Users could enjoy the hardware acceleration for MXFP scale derivation on Blackwell architecture.
- Users could enjoy cast-only MXFP8 out of the box with torchao to achieve iso-accuracy as BF16 for deployment.
- Users could enjoy iso-accuracy MXFP8 training and downstream task as BF16.