[LAYOUTS] Generic stmatrix lowering #6609

lezcano · 2025-04-25T14:01:36Z

We use divideLeft to lower a generic local_store using stmatrix
whenever possible.

We implement ColumnAction as a helper that allows us to permute
the bases and values and remove broadcasting in a generic way.

The current codegen for stmatrix has some limitations, so we just
return early in those cases. We'll fix those in a future PR.

peterbell10 · 2025-04-25T15:38:28Z

third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/MemoryOpToLLVM.cpp

+  auto regBase = applyLinearLayout(loc, rewriter, quot,
+                                   {{kReg, b.lshr(laneId, b.i32_val(3))},
+                                    {kLane, b.and_(laneId, b.i32_val(0x7))},
+                                    {kWarp, warpId}})[0]


What's going on here, seems like some abuse is going on?

abuse? This is just implementing the address map that ldmatrix asks for:

In particular, the lower three bits of the lane should map to the columns, while the top two bits should map to the 4 different matrices (given by the first 2 basis of the reps of quot). I'll write a comment.

But you're setting the register index based on the lane id so clearly you're abusing the labels kReg and kLane to mean something else. That's pretty confusing.

I'm a bit concerned that this new feature that is supposed to simplify all our lowerings is resulting in (for me) unreadable code.

Okay, I understand now. When using .x4 you pass the pointer corresponding to the row for register T0:r[i] in T(0 + 8*i):p so the lane really is giving the register index. Makes sense. I think I would restate your comment though as:

// Here we implement the stmatrix.x4 addressing. In particular, the row pointers // for each submatrix r in thread t are communicated by the stmatrix call in // laneId = (t // 8) + (8 * r), so r = laneId // 8 and t = laneId % 8.

I rewrote it in a third way, tell me WDYT.

lezcano · 2025-04-25T15:47:10Z

fwiw, before merging, I'll implement something so that we just lower via this path if the lowering would have no bank conflicts. Just to make sure we don't over-use this function. I'm also missing to run benchmarks.

third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/MemoryOpToLLVM.cpp

Jokeren

Looks good to me. The only comment I have is that permuteInDimToFront can be very useful in other places where we need register permutation so probably better to put in LayoutUtils.h

Jokeren · 2025-04-26T03:12:50Z

Also I think stmatrix use in convertlayout op lowering hasn't been replaced yet. Could be done in another PR though

peterbell10 · 2025-04-28T21:31:28Z

lib/Tools/LayoutUtils.cpp

+}
+
+std::optional<ColumnAction> actionDivideLeft(const LinearLayout &A,
+                                             const LinearLayout &B) {


IIUC this is more like findRegPermutationThatDividesLeft?

sure, I was just looking for a shorter name as this function is going to be used quite a bit. I was also thinking of packing this + the division and action on the regs into a helper function. Might do that tomorrow.

chose a shorter but representative name

peterbell10 · 2025-04-28T21:36:56Z

third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/MemoryOpToLLVM.cpp

+  // first submatrix, threads 8-15 for the second submatrix, etc. In general we
+  // map:
+  // - The lowest 3 bits of the thread id to the columns of each submatrix
+  // - The top 2 bits to the submatrix number (which is indexed by the next 2


I don't find this very helpful because it relates thread id to columns of the matrix, but not to register and lane id which is what you're giving to the linear layout. So this comment is correct, but doesn't explain the code at all.

The mapping to the 2 bits into the registers is explained in the parens in this line you comment.
The only thing that's missing is noting that each column i starts with thread t[4*i], and since the quotient has already removed the 4 threads, then on this layout hte map is i -> t[i].
The second part is a bit redundant once you have intuition for what divideLeft does tho.

Would you want me to add that?

Ah so kLane refers to the lane id with two bases removed such that it's not really a lane id any more. This is why you refer to them by different names now in the comment, but not in the code.

Perhaps It would be less confusing to always act on the "reps" layout? idk

In this case that'd be equivalent to the current state, generating the same number of ops.

Now, the better codegen would be to simply create a new layout composed of:

3 zero bases and he first vec non-zero bases of quot[kReg] (2 int his case)

The first 3 non-zero basas of `quot[kLane]

All the warp bases
We can call these dimensions kVec, kCol, kWarp respectively

This would make sure that you can just pass

applyLinearLayout(loc, rewriter, quot, {{kVec, laneId}, {kCol, laneId}, {kWarp, warpId}})[0]

This linear layout has just only 5 ones in the columns of kReg and kLane so it would generate pretty low opcount (note that now there are more ones in the kReg dim as we didn't trim them. LLVM may be able to optimise these, but who knows)

I'm leaving things as-is for now tho. The main thing to note is that, after dividing by a layout, the labels on the resulting layout do not refer to the hardware coords (neither the offsets on the output refer to the actual offset) but they refer to equivalence classes.

In CS terms, this would be like when you have a float ptr and doing increments of 1 with operator[] moves you 4 bytes. Then a division would be like casting it to float4 (grouping a tile of 4 together). Then, the result of the division tells you how to move around in the float4 world, where you have collapsed 4 floats into one element. In particular, in this world, moving 1 moves you 16 bytes.

This is what we are doing but with a more complex tile, where moving 1 over a thread moves us 8 offsets (i.e. one full tile, as per the line regBase = b.shl(regBase, b.i32_val(tile.getTotalOutDimSizeLog2()));)

nvm, I implemented it. Now the codegen should be better (on our end at least) and hopefully the explanation for why we create the layout should be clearer. WDYT

We use `divideLeft` to lower a generic `local_store` using `stmatrix` whenever possible. The current codegen for `stmatrix` has some limitations, so we just return early in those cases. We'll fix those in a future PR.

Generalise lowering by accepting permuted register layouts Simplify the lowering by using ColumnAction.

lezcano · 2025-04-29T10:45:19Z

third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/MemoryOpToLLVM.cpp

-  if (!sharedLayout)
+
+  // Inter block stmatrix is not supported
+  if (cvt.hasInDim(kBlock))


cc @peterbell10
There was a test in test_tensor_descriptor.py that was not passing with 2CTAs. Now with this line passes. This line says that we just bail out if there's any interCTA business going on. It seems to work just fine as long as you are addressing things within your own CTA, even without any map or anything.

lezcano · 2025-04-29T14:24:03Z

This is ready for review:

It's missing benchmarking, will do tomorrow
The convertlayout port I'll do it in a different PR, it should be easy.
The generalisation of the stmatrix to non bf16/f16, other vectorizations, transpose, I'll do in a different PR

peterbell10

My remaining comments are non-blocking, we can chat offline.

lezcano requested review from Jokeren and ptillet as code owners April 25, 2025 14:01

lezcano force-pushed the stmatrix branch from 78180b7 to f3c82ab Compare April 25, 2025 14:04

lezcano changed the title ~~[LAYOUTS] Lower stmatrix generically~~ [LAYOUTS] Generic stmatrix lowering Apr 25, 2025

peterbell10 reviewed Apr 25, 2025

View reviewed changes

third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/MemoryOpToLLVM.cpp Outdated Show resolved Hide resolved

Jokeren reviewed Apr 26, 2025

View reviewed changes

lezcano requested review from antiagainst and zhanglx13 as code owners April 28, 2025 15:54

peterbell10 reviewed Apr 28, 2025

View reviewed changes

lezcano added 7 commits April 29, 2025 10:57

[LAYOUTS] Lower stmatrix generically

3749c50

We use `divideLeft` to lower a generic `local_store` using `stmatrix` whenever possible. The current codegen for `stmatrix` has some limitations, so we just return early in those cases. We'll fix those in a future PR.

add lit tests

87c4d1c

Exit on Ampere

b55b9cf

comment

be27a95

Implement and test ColumnAction.

0b47bc5

Generalise lowering by accepting permuted register layouts Simplify the lowering by using ColumnAction.

remove prints

068c6cc

Relax the 2CTA condition

9072bb0

lezcano force-pushed the stmatrix branch from f9ecc63 to 9072bb0 Compare April 29, 2025 10:43

lezcano commented Apr 29, 2025

View reviewed changes

lezcano added 3 commits April 29, 2025 12:33

fix

6e59def

Improve and clarify addressing

ed42f27

Fix comments and rename function

5711624

fix

d7d2525

peterbell10 approved these changes Apr 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LAYOUTS] Generic stmatrix lowering #6609

[LAYOUTS] Generic stmatrix lowering #6609

lezcano commented Apr 25, 2025 •

edited

Loading

peterbell10 Apr 25, 2025

lezcano Apr 25, 2025

peterbell10 Apr 25, 2025 •

edited

Loading

peterbell10 Apr 25, 2025

peterbell10 Apr 25, 2025

lezcano Apr 28, 2025

lezcano commented Apr 25, 2025

Jokeren left a comment

Jokeren commented Apr 26, 2025

peterbell10 Apr 28, 2025

lezcano Apr 28, 2025

lezcano Apr 29, 2025

peterbell10 Apr 28, 2025

lezcano Apr 28, 2025 •

edited

Loading

peterbell10 Apr 28, 2025 •

edited

Loading

lezcano Apr 29, 2025

lezcano Apr 29, 2025

lezcano Apr 29, 2025

lezcano Apr 29, 2025

lezcano commented Apr 29, 2025

peterbell10 left a comment

[LAYOUTS] Generic stmatrix lowering #6609

Are you sure you want to change the base?

[LAYOUTS] Generic stmatrix lowering #6609

Conversation

lezcano commented Apr 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterbell10 Apr 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lezcano commented Apr 25, 2025

Jokeren left a comment

Choose a reason for hiding this comment

Jokeren commented Apr 26, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lezcano Apr 28, 2025 • edited Loading

Choose a reason for hiding this comment

peterbell10 Apr 28, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lezcano commented Apr 29, 2025

peterbell10 left a comment

Choose a reason for hiding this comment

lezcano commented Apr 25, 2025 •

edited

Loading

peterbell10 Apr 25, 2025 •

edited

Loading

lezcano Apr 28, 2025 •

edited

Loading

peterbell10 Apr 28, 2025 •

edited

Loading