[Blackwell] Refactor/slightly generalize warp specialization #6597

Mogball · 2025-04-24T20:19:05Z

This is basically a rewrite of LoadMMASpecialization to cleanly separate the warp specialization of loads and MMAs into discrete steps. This allows warp specialization of loads and MMAs separate from each other, and supports an arbitrary number of load groups and MMAs. This still places all loads and MMAs in the same partitions.

In addition, this PR separates the actual partition assignment from the multibuffering and loop lowering step, like SWP. This should make it easier to tweak partitioning strategies.

…ry for scales This ensures scales produced by TMA are elligible for transfer to tensor memory in later lowering. git-pr-chain: csullivan/support_desc_load_tmem_copy

… for tl.dot_scaled This enables automatic warp specialization for block scaled workloads. git-pr-chain: csullivan/support_block_scales_in_warp_spec

…n_warp_spec' into mogball/fmha

manman-ren

Looks good!

manman-ren · 2025-04-30T00:22:45Z

third_party/nvidia/backend/compiler.py

            passes.ttgpuir.add_warp_specialize(pm, opt.num_stages)
            passes.ttgpuir.add_pipeline(pm, opt.num_stages, dump_enabled)
            passes.ttgpuir.add_combine_tensor_select_and_if(pm)
-            nvidia.passes.ttnvgpuir.add_promote_lhs_to_tmem(pm)


I needed this same change for getting FA to run with WarpSpec on Blackwell.

manman-ren · 2025-04-30T18:54:04Z

include/triton/Dialect/TritonGPU/Transforms/PipeliningUtility.h

@@ -100,7 +100,8 @@ DenseMap<Operation *, int> deserializeLatencies(Operation *op);
 Value createScalarAlloc(ImplicitLocOpBuilder &rewriter, Type type,
                        unsigned numBuffers);
 // Create an allocation and init the mbarriers.
-Value createBarrierAlloc(scf::ForOp forOp, int numBarriers);
+Value createBarrierAlloc(scf::ForOp forOp, int numBarriers,
+                         int arriveCount = 1);


Is this part of refactoring? Or is it addressing a separate issue?

This is part of the refactor. Load groups can have multiple consumers

manman-ren · 2025-04-30T18:54:55Z

lib/Dialect/TritonGPU/Transforms/Pipeliner/PipeliningUtility.cpp

@@ -297,6 +298,8 @@ mlir::triton::getDefinitionAndDistance(scf::ForOp forOp, Value value) {
      return {nullptr, 0};
    ++distance;
    value = forOp.getYieldedValues()[arg.getArgNumber() - 1];
+    if (!seen.insert(value).second)
+      return {nullptr, 0};
  }


This also doesn't feel like refactoring :]

Some of the refactoring exposed a bug :P

ThomasRaoux

Nice!

Mogball and others added 30 commits April 16, 2025 20:29

start introducing tokens

7fc06c0

hoist tmem alloc

4023111

cleanup

56e1c8c

add test for sinking into conditional

225f241

fix tests and some bugs

94d991c

fix repl token

9b742a0

fix aws test

720a700

fix test

107daf0

fixing tests, remove TMEM tokens

19e37d4

separate pass for removing TMEM tokens

c09d3c5

fix tests

d2acd19

schedule loops

8521ccf

bench

db65c51

fix compile only test

4a9a5ed

delete dead code

c7cd1c4

unused forward decl

ad6d7de

Merge remote-tracking branch 'origin/main' into mogball/tmem_toks

303df26

[Blackwell] Support DescriptorLoadOp when deciding to use shared memo…

7bc72fc

…ry for scales This ensures scales produced by TMA are elligible for transfer to tensor memory in later lowering. git-pr-chain: csullivan/support_desc_load_tmem_copy

[Bench][Blackwell] Support optional scale TMAs in warp specialization…

7e608e3

… for tl.dot_scaled This enables automatic warp specialization for block scaled workloads. git-pr-chain: csullivan/support_block_scales_in_warp_spec

hoisttmemalloc checks that tokens are present

0dffe75

add doc about tokens to op definitions

8aa9165

Merge branch 'mogball/tmem_toks' into mogball/fmha

1349758

Merge remote-tracking branch 'origin/csullivan/support_block_scales_i…

54884c4

…n_warp_spec' into mogball/fmha

simplify util

7767409

refactor LoadMMASpecialization to support any number of loads

9d4ebe2

fix handling cycle in user partition

9e018e5

refactor loads into loadgroups

4a72ab8

Merge branch 'main' into mogball/tmem_toks

695eb2a

fix

d63cb82

cleanup packLL utilities

bffcb5b

Mogball added 15 commits April 25, 2025 15:08

done but does it work?

1f507ac

it deadlocks

dda423b

works but ends too early

d965b3b

fix regular matmul

b4cb1af

fix

1140e1c

fixed

4eff58e

forgot to handle P

38072a6

fix optzn

8358c0d

dep dialect

50c3b55

savepoint: OAI benchmarks look good

b4a8612

rename op

2b90525

put scales into smem

a912113

put local load in user partition

698e94f

add another test

7ffbd86

add another test

ee8eda3

Base automatically changed from mogball/tmem_toks to main April 29, 2025 01:05

Merge remote-tracking branch 'origin/main' into HEAD

93e42ba

Mogball force-pushed the mogball/fmha branch from fe24161 to 93e42ba Compare April 29, 2025 01:28

Mogball changed the title ~~[WIP][DNR] Refactor warp specialization~~ [Blackwell] Refactor warp specialization Apr 29, 2025

Mogball marked this pull request as ready for review April 29, 2025 02:19

Mogball requested a review from ptillet as a code owner April 29, 2025 02:19

Mogball added 2 commits April 29, 2025 10:04

refactor pipelineMMA

4fba3f6

handle peeled wait

1615916

Mogball requested a review from ThomasRaoux April 29, 2025 18:34

Merge branch 'main' into mogball/fmha

d8887ac

manman-ren reviewed Apr 30, 2025

View reviewed changes

Mogball changed the title ~~[Blackwell] Refactor warp specialization~~ [Blackwell] Refactor/slightly generalize warp specialization Apr 30, 2025

ThomasRaoux approved these changes May 1, 2025

View reviewed changes

Mogball merged commit 0719c00 into main May 1, 2025
8 checks passed

Mogball deleted the mogball/fmha branch May 1, 2025 18:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Blackwell] Refactor/slightly generalize warp specialization #6597

[Blackwell] Refactor/slightly generalize warp specialization #6597

Mogball commented Apr 24, 2025 •

edited

Loading

manman-ren left a comment

manman-ren Apr 30, 2025

manman-ren Apr 30, 2025

Mogball Apr 30, 2025

manman-ren Apr 30, 2025

Mogball Apr 30, 2025

ThomasRaoux left a comment

[Blackwell] Refactor/slightly generalize warp specialization #6597

[Blackwell] Refactor/slightly generalize warp specialization #6597

Conversation

Mogball commented Apr 24, 2025 • edited Loading

manman-ren left a comment

Choose a reason for hiding this comment

manman-ren Apr 30, 2025

Choose a reason for hiding this comment

manman-ren Apr 30, 2025

Choose a reason for hiding this comment

Mogball Apr 30, 2025

Choose a reason for hiding this comment

manman-ren Apr 30, 2025

Choose a reason for hiding this comment

Mogball Apr 30, 2025

Choose a reason for hiding this comment

ThomasRaoux left a comment

Choose a reason for hiding this comment

Mogball commented Apr 24, 2025 •

edited

Loading