Introduce `arith.scaling_extf` and `arith.scaling_truncf` #141965

umangyadav · 2025-05-29T15:29:59Z

This PR adds arith.scaling_truncf and arith.scaling_extf operations which does the block quantization following OCP MXFP specs listed here https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf

OCP MXFP Spec comes with reference implementation here https://github.com/microsoft/microxcaling/tree/main

Interesting piece of reference code is this method _quantize_mx https://github.com/microsoft/microxcaling/blob/7bc41952de394f5cc5e782baf132e7c7542eb4e4/mx/mx_ops.py#L173.

Both arith.scaling_truncf and arith.scaling_extf are designed to be an elementwise operation. Please see description about them in ArithOps.td file for more details.

A few things to note about the arith.scaling_truncf

OCP Spec flushes denorms to zero.
It normalizes the shared scale exponent by emax (exponent of largest normal number in resulting quantized type).
Clamps normalized shared exponent.
NaNs are propagated

CC: @krzysz00 @dhernandez0 @bjacob @pashu123 @MaheshRavishankar @tgymnich

krzysz00

Some notes

mlir/lib/Dialect/Arith/Transforms/ExpandOps.cpp

krzysz00 · 2025-05-29T17:20:39Z

mlir/lib/Dialect/Arith/Transforms/ExpandOps.cpp

+      return rewriter.notifyMatchFailure(
+          op, "scaling truncf is not using scale operand of type f8E8M0FNU");
+    }
+    auto scaleTy = scaleOperand.getType();


krzysz00 · 2025-05-29T17:21:38Z

mlir/lib/Dialect/Arith/Transforms/ExpandOps.cpp

+    } else if (inputETy.getIntOrFloatBitWidth() > 32) {
+      inputOperand = b.create<arith::TruncFOp>(f32Ty, inputOperand);
+    }
+    inputTy = inputOperand.getType();


We could update these to f32Type in the if statements above, but it doesn't matter

krzysz00 · 2025-05-29T17:25:07Z

mlir/lib/Dialect/Arith/Transforms/ExpandOps.cpp

+    Value c127 = createConst(op->getLoc(), i32Ty, 127, rewriter);
+    Value cNeg127 = createConst(op->getLoc(), i32Ty, -127, rewriter);
+    Value scaleI8 = b.create<arith::BitcastOp>(i8Ty, scaleOperand);
+    Value scaleI32 = b.create<arith::ExtSIOp>(i32Ty, scaleI8);


This should be an extui. But also, there's no need to go i32 here

I first need to calculate unbiased scale value. I can do that while being in i8.

But then i also need to subtract emax (max exponent of largest normal number in resultant quantized dtype).
That subtraction could underflow or overflow and that needs to be checked and clamped later on. Therefore i require i32

This should be an extui.

Thanks. Good catch.

Ok, so, my bigger complaint is that you can simplify the generated code substantially if you just switch on what kind of type you're extending to

That is, f32 requires nothing - that's already a +- 127 situation

Types shorter than f32 will need the subtraction.

... Also, I'm doing to re-read the code but I'm not convinced this should be subtracting the max normalized exponent. Are we sure it isn't "clamp to the exponent range of the type"?

... Ah, we're subtracting the max exponent of the result type

Which can't lead to overflow

This could be substantially simplified if we just use usub_sat (which we'd need a MLIR Arith op for but that's fairly trivial)

... But also, the code you linked is for quantization

I think it's reasonable to assume that someone implementing quantization will already have done the scale-biasing thing and so we don't need to do it here

Unless we have evidence that the hardware implementations perform the subtraction described here? (We'll probably want to go find the AMD behavior)

... and if you're doing usub_sat, you don't need to unbias the exponent.

But also, I'd make sure this is something that other implementors of scaling_truncf implement so we don't get conflicting lowerings

krzysz00 · 2025-05-29T17:26:08Z

mlir/lib/Dialect/Arith/Transforms/ExpandOps.cpp

+    const llvm::fltSemantics &resultFltSemantics =
+        llvm::cast<FloatType>(resultETy).getFloatSemantics();
+    int maxExponent = APFloat::semanticsMaxExponent(resultFltSemantics);
+    Value cMaxNormalExponent =


Skip all this if we're in f32 or higher?

Rewrote using f32.

krzysz00 · 2025-05-29T17:29:05Z

mlir/lib/Dialect/Arith/Transforms/ExpandOps.cpp

+    Value cmpCond = b.create<arith::CmpIOp>(arith::CmpIPredicate::eq, cI8Zero,
+                                            inputExponentU8);
+    Value inputTyZero = createFloatConst(op.getLoc(), inputTy, 0, rewriter);
+    Value flushedInput =


This all seems overcomplicated?

This could just be extending the scale to f32?

Rewrote using f32. It does simplify things a bit. Thanks

mlir/include/mlir/Dialect/Arith/IR/ArithOps.td

Co-authored-by: Prashant Kumar <pk5561@gmail.com>

Co-authored-by: Krzysztof Drewniak <Krzysztof.Drewniak@amd.com>

mlir/include/mlir/Dialect/Arith/IR/ArithOps.td

dhernandez0 · 2025-05-30T09:08:09Z

mlir/include/mlir/Dialect/Arith/IR/ArithOps.td

+    Let's say originally input is shape <dim1 x dim2 x dim3 .. x dimN> then, given blockSize it can be reshaped to <dim1 x dim2 x ... (dimN/blockSize) x blockSize>. 
+    Scales will be calculated on the block axis. Therefore scale will be of shape <dim1 x dim2 x dim3 ... (dimN/blockSize) x 1>. 
+    Before calling into `arith.scaling_extf`, scales must be broadcasted appropariately to make it as same shape as input making `arith.scaling_extf` an elemenwise op.  
+    In above example. scales should be broadcasted to shape of <dim1 x dim2 x dim3 x ... (dimN/blockSize) x blockSize>.


I understand from the description, it doesn't need to be broadcasted, you could use a non-broadcasted tensor of shape <dim1 x dim2 x dim3 x ... (dimN/blockSize) x blockSize>?

If that's the case, I don't think it's useful to explain all of these details, broadcasting is just a use-case. If I understood it correctly.

I think the description needs to be updated - this arith op is set up to do things elementwise because arith ops in general are elementwise and the broadcast scale thing is a special case that gets pattern-matched in a future ArithToAMDGPU

I tried to rewrite documentation. Please check again and let me know if it is more clear now.

mlir/include/mlir/Dialect/Arith/IR/ArithOps.td

dhernandez0 · 2025-05-30T09:27:44Z

mlir/lib/Dialect/Arith/Transforms/ExpandOps.cpp

+          op, "scaling extf is not using scale operand of type f8E8M0FNU");
+    }
+    Type resultTy = op.getType();
+    // extf on scale will essentially create f32 number that is 2^scale and will


why f32? can't resultTy be any float type?

should we check if resultTy >= Float8E8M0FNU and >= inputType

In principle, scaled truncation from f32 to f32 is a really weird way to spell division,b ut we might want to verify it away

why f32? can't resultTy be any float type

Changed comment to better reflect what it's doing.

should we check if resultTy >= Float8E8M0FNU and >= inputType

As part of verification, it checks that output dtype is of larger widhth compared to input.
https://github.com/umangyadav/llvm-project/blob/d1543414578abf95a495b4eb6fe9b6201de8e9f6/mlir/lib/Dialect/Arith/IR/ArithOps.cpp#L1460

dhernandez0 · 2025-05-30T09:30:30Z

mlir/lib/Dialect/Arith/Transforms/ExpandOps.cpp

+    Value result = b.create<arith::DivFOp>(flushedInput, scaleF32);
+    // propagate rounding mode and fast math attributes
+    Value resultCast = b.create<arith::TruncFOp>(
+        resultTy, result, op.getRoundingmodeAttr(), op.getFastmathAttr());


there are other arith ops, shouldn't we propagate to those as well? also for ScalingExtFOpConverter

should we check resultTy <= f32?

should we check resultTy <= f32?

Verify() checks that output width is smaller compared to input.

https://github.com/umangyadav/llvm-project/blob/d1543414578abf95a495b4eb6fe9b6201de8e9f6/mlir/lib/Dialect/Arith/IR/ArithOps.cpp#L1587

there are other arith ops, shouldn't we propagate to those as well? also for ScalingExtFOpConverter

No, other arith.truncf are mainly for scales dtype conversion which just operates on exponent and not really affected by rounding mode or fast math.

Yes, verify checks that output width is smaller than input width. But I understand the output of this function is always f32. Then, I wonder if somebody can do input, scale -> f128, result -> f64. Then, it's true that output width < input width and we are still trying to truncate "result" which is f32 into f64. Not sure if I misunderstood something?

In practice, Float64/80/128 dtypes are something that is not expected. I think it is safe to assume F32 is the largest dtype that can appear on the input.
Then, Verify() checks is a strict check. Therefore output_bit_width < input_bit_width.
So this would never really be truncating to f32 resultTy in practice.

But I understand the output of this function is always f32

No, why do you think so ? Output dtype will be whatever user has specified.

No, why do you think so ? Output dtype will be whatever user has specified.

I mean result of the function before truncation. result.dtype = f32, right?

In practice, Float64/80/128 dtypes are something that is not expected. I think it is safe to assume F32 is the largest dtype that can appear on the input.

I think arith dialect is not supposed to be hardware specific, so even though for us it's not expected. I'd prefer to enforce or check the assumption somehow. But it seems ok for me anyway, whatever you decide.

mlir/lib/Dialect/Arith/Transforms/ExpandOps.cpp

mlir/include/mlir/Dialect/Arith/IR/ArithOps.td

pashu123

Some minor nits! LGTM. I'll wait for @krzysz00.

mlir/include/mlir/Dialect/Arith/IR/ArithOps.td

mlir/include/mlir/Dialect/Arith/Transforms/Passes.h

mlir/lib/Dialect/Arith/Transforms/ExpandOps.cpp

pashu123 · 2025-05-31T02:21:24Z

mlir/lib/Dialect/Arith/Transforms/ExpandOps.cpp

+                              PatternRewriter &rewriter) {
+  auto attr = rewriter.getFloatAttr(getElementTypeOrSelf(type), value);
+  if (auto shapedTy = dyn_cast<ShapedType>(type)) {
+    return rewriter.create<arith::ConstantOp>(


We can update the attr here: attr = DenseElementsAttr::get(shapedTy, attr). It will return the right thing. (Both are fine to me).

mlir/lib/Dialect/Arith/Transforms/ExpandOps.cpp

Co-authored-by: Prashant Kumar <pk5561@gmail.com>

mlir/include/mlir/Dialect/Arith/IR/ArithOps.td

dhernandez0 · 2025-06-02T07:38:32Z

mlir/include/mlir/Dialect/Arith/IR/ArithOps.td

+    // emax is calculated as exponent of the largest normal value in quantized type.
+    scale.normalize = arith.divf(scale.extf, emax)   
+    scale.clamped = clamp(scale.normalize) // clamp underflows
+    input.flused = flush_denorms(input)


there are some type conversions for input and scale that are not explained here. Not sure if we want all those details here?

IMO, That would be more details than necessary.

dhernandez0 · 2025-06-02T12:32:09Z

mlir/lib/Dialect/Arith/Transforms/ExpandOps.cpp

+    Value result = b.create<arith::DivFOp>(flushedInput, scaleF32);
+    // propagate rounding mode and fast math attributes
+    Value resultCast = b.create<arith::TruncFOp>(
+        resultTy, result, op.getRoundingmodeAttr(), op.getFastmathAttr());


No, why do you think so ? Output dtype will be whatever user has specified.

I mean result of the function before truncation. result.dtype = f32, right?

In practice, Float64/80/128 dtypes are something that is not expected. I think it is safe to assume F32 is the largest dtype that can appear on the input.

I think arith dialect is not supposed to be hardware specific, so even though for us it's not expected. I'd prefer to enforce or check the assumption somehow. But it seems ok for me anyway, whatever you decide.

dhernandez0

LGTM

krzysz00 · 2025-06-03T21:51:49Z

Note: I'd rather we not land this just yet because I'm still waiting to find out if potential hardware-specific lowerings of arith.scaling_truncf will perform the exponent subtraction that this code does.

I have a suspicion that the answer is "no" - that that adjustment is part of the scale computation process, not the scale application process, and so the semantics of scaling_truncf shouldn't include it.

krzysz00

Hold for semantics questions, and @llvm/pr-subscribers-mlir-nvgpu for input on Nvidia semantics while I wait on AMD answers

umangyadav and others added 14 commits May 28, 2025 19:51

Make it elementwise op

1ed7462

Add flushing logic

91bb889

Fix build issues

8eebbea

clamping on exponent

acc6658

propagate rounding mode and fast math attrs

6797446

Add some more notes

3ad83bd

Merge branch 'main' into scaling_cvt

9f755c2

add scaling_extf tests

5e49a72

Fix some issues

682573e

add test for scaling_truncf

de4497b

add some more tests

e239157

Fix Formatting

646465c

Merge branch 'main' into scaling_cvt

20b0928

Remove TODO

80c080f

krzysz00 reviewed May 29, 2025

View reviewed changes

pashu123 reviewed May 29, 2025

View reviewed changes

umangyadav and others added 6 commits May 29, 2025 15:01

Update mlir/include/mlir/Dialect/Arith/IR/ArithOps.td

b5df100

Co-authored-by: Prashant Kumar <pk5561@gmail.com>

Update mlir/include/mlir/Dialect/Arith/IR/ArithOps.td

b6589ae

Co-authored-by: Prashant Kumar <pk5561@gmail.com>

Update mlir/lib/Dialect/Arith/Transforms/ExpandOps.cpp

12c52a6

Co-authored-by: Krzysztof Drewniak <Krzysztof.Drewniak@amd.com>

Update mlir/lib/Dialect/Arith/Transforms/ExpandOps.cpp

b3cadf2

Co-authored-by: Krzysztof Drewniak <Krzysztof.Drewniak@amd.com>

Merge remote-tracking branch 'upstream/main' into scaling_cvt

5558b03

Allow implicit truncf to f8E8M0FN type to extract exponent bits

fc90780

dhernandez0 reviewed May 30, 2025

View reviewed changes

umangyadav added 7 commits May 30, 2025 17:05

USe floating point to normalize scales

8f91e28

Rewrite description

dc7b67f

change error message

109ddc5

some nits

f3d9865

Merge remote-tracking branch 'upstream/main' into scaling_cvt

95a7558

Formatting

3ccb208

Change comment

d154341

umangyadav requested review from krzysz00, pashu123 and dhernandez0 May 30, 2025 22:27

pashu123 approved these changes May 31, 2025

View reviewed changes

umangyadav commented May 31, 2025

View reviewed changes

mlir/lib/Dialect/Arith/Transforms/ExpandOps.cpp Outdated Show resolved Hide resolved

umangyadav and others added 4 commits May 31, 2025 08:23

Update mlir/include/mlir/Dialect/Arith/Transforms/Passes.h

d8a76fa

Co-authored-by: Prashant Kumar <pk5561@gmail.com>

Update mlir/lib/Dialect/Arith/Transforms/ExpandOps.cpp

a0aa490

address some review comments

ff66dad

Merge branch 'main' into scaling_cvt

10a1bc3

dhernandez0 reviewed Jun 2, 2025

View reviewed changes

umangyadav added 2 commits June 2, 2025 15:03

Merge remote-tracking branch 'upstream/main' into scaling_cvt

3c7980d

Fix docs

f7c1b79

dhernandez0 reviewed Jun 3, 2025

View reviewed changes

tgymnich approved these changes Jun 3, 2025

View reviewed changes

krzysz00 requested changes Jun 3, 2025

View reviewed changes

Introduce arith.scaling_extf and arith.scaling_truncf #141965

Are you sure you want to change the base?

Introduce arith.scaling_extf and arith.scaling_truncf #141965

Uh oh!

Conversation

umangyadav commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

krzysz00 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

umangyadav May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

umangyadav May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dhernandez0 Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Introduce `arith.scaling_extf` and `arith.scaling_truncf` #141965

Introduce `arith.scaling_extf` and `arith.scaling_truncf` #141965

umangyadav commented May 29, 2025 •

edited

Loading

umangyadav May 29, 2025 •

edited

Loading

umangyadav May 30, 2025 •

edited

Loading

dhernandez0 Jun 2, 2025 •

edited

Loading

dhernandez0 Jun 2, 2025 •

edited

Loading