Grouped GEMM with ck_tile by matthiasdiener · Pull Request #434 · ROCm/TransformerEngine

matthiasdiener · 2026-01-28T15:49:27Z

Description

See https://github.com/ROCm/frameworks-internal/issues/15185 and https://github.com/ROCm/frameworks-internal/issues/13792 for context.

Primus-Turbo implementation: https://github.com/AMD-AGI/Primus-Turbo/blob/5bcd13785ef380fec0eec0911b7d6db5e606143e/csrc/kernels/grouped_gemm

TODOs:

Enable tests in test_numerics.py
Make kernels selectable & tunable
Handle gelu/bias (or make sure these are not passed in)
Performance analysis and improvements: https://github.com/ROCm/frameworks-internal/issues/15185#issuecomment-3863052452
More tests

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Implement ck_tile-based group GEMM, similar to Cutlass

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

This reverts commit 86fbbac.

…mTest

tests/pytorch/test_numerics.py

transformer_engine/common/gemm/ck_grouped_gemm.cpp

transformer_engine/common/gemm/ck_grouped_gemm.cuh

transformer_engine/common/gemm/cublaslt_gemm.cu

transformer_engine/common/CMakeLists.txt

transformer_engine/common/gemm/cublaslt_gemm.cu

ipanfilo · 2026-02-21T17:05:01Z

tests/pytorch/test_numerics.py

 )
 if IS_HIP_EXTENSION:
-    from transformer_engine.pytorch.utils import is_mi200, is_mi308
+    from transformer_engine.pytorch.utils import is_mi200, is_mi308, is_mi300_class


is_mi300_class methods is not needed, it is just 9.4 gfx family

Removed in 7910038

ipanfilo · 2026-02-21T17:06:34Z

transformer_engine/common/gemm/ck_grouped_gemm.cpp

@@ -0,0 +1,276 @@
+/* Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved. */


Add proper copyright header

Thanks, done in f680d6a

ipanfilo · 2026-02-21T17:09:31Z

transformer_engine/common/gemm/ck_grouped_gemm.h

@@ -0,0 +1,11 @@
+/* Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved. */


Put proper copyright header

Thanks, done in f680d6a

transformer_engine/common/gemm/cublaslt_gemm.cu

transformer_engine/common/gemm/ck_grouped_gemm.cpp

ipanfilo · 2026-02-21T20:05:14Z

transformer_engine/common/gemm/ck_grouped_gemm.cpp

+                                    size_t workspace_bytes,
+                                    hipStream_t stream) {
+
+// FIXME: This could be a templated lambda function in C++20.


As an alternative dispatch_grouped can be incorporated to ck_tile_grouped_gemm with using of nested TRANSFORMER_ENGINE_SWITCH_CONDITION

What do you think of 6d85088?

I misread your initial comment, c5d83a4 merges dispatch_grouped and ck_tile_grouped_gemm.

tests/pytorch/test_numerics.py

transformer_engine/common/gemm/ck_grouped_gemm.cpp

wangye805 · 2026-02-23T23:37:34Z

transformer_engine/common/gemm/ck_grouped_gemm.cpp

+  if (!transA_use && !transB_use) { CALL(RowMajor, RowMajor, false, false); }
+  if (!transA_use &&  transB_use) { CALL(RowMajor, ColMajor, false, true ); }
+  if ( transA_use && !transB_use) { CALL(ColMajor, RowMajor, true,  false); }
+  /* transA_use && transB_use */  { CALL(ColMajor, ColMajor, true,  true ); }


NV upstream does not support TT, do we support TT?

We do, yes.

transformer_engine/common/gemm/ck_grouped_gemm.cpp

ipanfilo · 2026-02-24T21:07:51Z

transformer_engine/common/gemm/ck_grouped_gemm.cpp

+}
+
+template <typename T, typename CLayout, ck_tile::memory_operation_enum MemOp>
+static inline bool dispatch_grouped(bool transA_use,


Why separate function is needed?

Not strictly needed, but merging dispatch_grouped and run_grouped_impl makes the resulting function very complex, and this complexity will likely increase when we add FP8 support.

Here is how it would look like:

Details

template <typename T, typename CLayout, ck_tile::memory_operation_enum MemOp> static inline bool dispatch_grouped(bool transA_use, bool transB_use, const transformer_engine::Tensor* const* A_use, const transformer_engine::Tensor* const* B_use, transformer_engine::Tensor* const* D, int group_num, void* workspace, size_t workspace_bytes, hipStream_t stream) { int64_t ref_d0 = 0, ref_d1 = 0; if (!get_flat_2d_dims(*D[0], ref_d0, ref_d1)) { NVTE_ERROR("ck_tile_grouped_gemm: expected rank>=2 for D[0]"); return false; } const ck_tile::index_t N = static_cast<ck_tile::index_t>(ref_d1); auto run_with_tilecfg = [&](auto tile_tag) -> bool { using TileCfgSel = decltype(tile_tag); TRANSFORMER_ENGINE_SWITCH_CONDITION(transA_use, kTransA, { using ALayout = std::conditional_t<kTransA, ColMajor, RowMajor>; TRANSFORMER_ENGINE_SWITCH_CONDITION(transB_use, kTransB, { using BLayout = std::conditional_t<kTransB, ColMajor, RowMajor>; using Kernel = typename Runner<T, T, T, ALayout, BLayout, CLayout, TileCfgSel, MemOp>::Kernel; const size_t needed = Kernel::GetWorkSpaceSize(group_num); if (!workspace || workspace_bytes < needed) { NVTE_ERROR("ck_tile_grouped_gemm: insufficient workspace. Needed bytes=", needed); return false; } thread_local std::vector<ck_tile::GroupedGemmHostArgs<0>> descs; descs.clear(); descs.reserve(group_num); for (int i = 0; i < group_num; ++i) { const auto& a = data_view(*A_use[i]); const auto& b = data_view(*B_use[i]); const auto& d = data_view(*D[i]); int64_t Ad0 = 0, Ad1 = 0, Bd0 = 0, Bd1 = 0, Dd0 = 0, Dd1 = 0; if (!get_flat_2d_dims(*A_use[i], Ad0, Ad1) || !get_flat_2d_dims(*B_use[i], Bd0, Bd1) || !get_flat_2d_dims(*D[i], Dd0, Dd1)) { NVTE_ERROR("ck_tile_grouped_gemm: expected all groups to be rank>=2 (2D or higher)."); return false; } const int64_t M = transA_use ? Ad1 : Ad0; const int64_t K = transA_use ? Ad0 : Ad1; const int64_t N = transB_use ? Bd0 : Bd1; const int64_t Kb = transB_use ? Bd1 : Bd0; if (Kb != K) { NVTE_ERROR("ck_tile_grouped_gemm: K mismatch between A and B in group ", i); return false; } if (Dd0 != M || Dd1 != N) { NVTE_ERROR("ck_tile_grouped_gemm: D shape mismatch in group ", i); return false; } // Leading dimensions under the flattened-contiguous interpretation const ck_tile::index_t stride_A = Ad1; const ck_tile::index_t stride_B = Bd1; const ck_tile::index_t stride_E = Dd1; descs.emplace_back( a.dptr, b.dptr, std::array<const void*, 0>{}, d.dptr, 1, M, N, K, stride_A, stride_B, std::array<ck_tile::index_t, 0>{}, stride_E); } const dim3 grids = Kernel::GridSize(descs); auto kargs = Kernel::MakeKargs(descs); if (!Kernel::IsSupportedArgument(kargs)) { NVTE_ERROR("ck_tile_grouped_gemm: CK_Tile kernel arguments not supported for this config."); return false; } HIP_CHECK_ERROR(hipMemcpyAsync(workspace, kargs.data(), kargs.size() * sizeof(typename decltype(kargs)::value_type), hipMemcpyHostToDevice, stream)); const ck_tile::stream_config s{stream}; const dim3 blocks = Kernel::BlockSize(); ck_tile::launch_kernel( s, ck_tile::make_kernel<1>( Kernel{}, grids, blocks, 0, ck_tile::cast_pointer_to_constant_address_space(workspace), group_num)); return true; }); }); }; // Select tile config like Primus-Turbo for FP16/BF16: // N%256 -> 256x256x64 // N%128 -> 256x128x64 // else -> 256x128x64 padding // NOTE: We assume N is uniform across groups. if ((N % 256) == 0) { return run_with_tilecfg(TileCfg_256x256x64{}); } else if ((N % 128) == 0) { return run_with_tilecfg(TileCfg_256x128x64{}); } else { return run_with_tilecfg(TileCfg_256x128x64_padding{}); } }

Which one do you prefer?

I found a way to merge dispatch_grouped and ck_tile_grouped_gemm instead. Implemented in c5d83a4

transformer_engine/common/gemm/cublaslt_gemm.cu

transformer_engine/common/CMakeLists.txt

ipanfilo · 2026-02-24T21:11:40Z

transformer_engine/common/gemm/cublaslt_gemm.cu

 #else
-  const int current_device = transformer_engine::cuda::current_device();
-  const bool is_hopper = (transformer_engine::cuda::sm_arch(current_device) == 90);
+    const int current_device = transformer_engine::cuda::current_device();


Please restore indent

Done in 98e0c66

transformer_engine/common/gemm/cublaslt_gemm.cu

ipanfilo · 2026-02-26T04:58:36Z

transformer_engine/common/gemm/cublaslt_gemm.cu

+  const int current_device = transformer_engine::cuda::current_device();
+  const bool is_hopper = (transformer_engine::cuda::sm_arch(current_device) == 90);
 #endif
  const bool use_cutlass = transformer_engine::getenv<bool>("NVTE_USE_CUTLASS_GROUPED_GEMM", false);


I wonder, should we use different env name on ROCm? Or it should be well documented - what does CUTLASS mean on ROCm

Previously Matthias has another env. I left the comment to suggest use the same env as NV upstream since I recall CK is meant to be a drop-in replacement for cutlass?

Maybe we can explain this in README?

I added a paragraph in the README in 7b1dbfa, what do you think?

wangye805 · 2026-02-27T16:23:22Z

transformer_engine/common/gemm/cublaslt_gemm.cu

+  const int current_device = transformer_engine::cuda::current_device();
+  const bool is_hopper = (transformer_engine::cuda::sm_arch(current_device) == 90);
 #endif
  const bool use_cutlass = transformer_engine::getenv<bool>("NVTE_USE_CUTLASS_GROUPED_GEMM", false);


Previously Matthias has another env. I left the comment to suggest use the same env as NV upstream since I recall CK is meant to be a drop-in replacement for cutlass?

Maybe we can explain this in README?

Added information about CK_Tile-based grouped GEMM implementation and how to enable it.

matthiasdiener added 16 commits December 9, 2025 17:01

GEMM reference HIP implementation

ad748da

blockwise amax

11e090b

Merge branch 'dev' into compute-ref-offload

9006224

Change to use Tensor arguments, combine mxfp8/non-mxfp8 paths

3ecea7f

Merge remote-tracking branch 'origin/dev' into compute-ref-offload

cafee59

skip on SwizzleScale limitation on gfx950

86fbbac

Revert "skip on SwizzleScale limitation on gfx950"

54de3db

This reverts commit 86fbbac.

MXFP8 fix

311ddfe

Merge remote-tracking branch 'origin/dev' into compute-ref-offload

306e432

correct scale_inv packing and exp2(biased−127) conversion

445e64f

cleanups

462945f

Merge branch 'dev' into compute-ref-offload

e32fb3d

Merge remote-tracking branch 'origin/dev' into compute-ref-offload

7bf8adb

use Tensor class for more device objects

e11e400

Pass D Tensor into run_reference and move RefD allocation into Perfor…

325ece6

…mTest

[WIP] proof-of-concept: grouped GEMM with ck_tile

fc64b8c

matthiasdiener self-assigned this Jan 28, 2026

matthiasdiener added 3 commits January 28, 2026 09:51

Merge branch 'dev' into ck-grouped-gemm

134b350

restructure and enable tests

9091e6c

Merge remote-tracking branch 'origin/dev' into ck-grouped-gemm

7435062

matthiasdiener changed the title ~~[WIP] proof-of-concept: grouped GEMM with ck_tile~~ [WIP] Grouped GEMM with ck_tile Jan 29, 2026

matthiasdiener added 2 commits January 30, 2026 14:09

Merge remote-tracking branch 'origin/dev' into ck-grouped-gemm

a00a1c8

grid improvements

4e9ead9

wangye805 requested changes Feb 2, 2026

View reviewed changes

restructure

259645c

wenchenvincent requested a review from aris134 February 4, 2026 17:04

matthiasdiener added 4 commits February 4, 2026 15:41

reduce code duplication & simplify

9986bd4

make the code more similar to nv, check emopty gelu/bias

355ec2f

Merge branch 'dev' into ck-grouped-gemm

df5e3ea

further simplify & make closer to nv

a42f7ca

matthiasdiener added 3 commits February 17, 2026 20:26

Merge remote-tracking branch 'origin/dev' into ck-grouped-gemm

29d6ab7

Merge remote-tracking branch 'origin/dev' into ck-grouped-gemm

6e9aae4

remove test file

2e844d9

matthiasdiener marked this pull request as ready for review February 17, 2026 22:58

matthiasdiener requested review from ipanfilo and wenchenvincent as code owners February 17, 2026 22:58

Merge branch 'dev' into ck-grouped-gemm

4aa8229

ipanfilo requested changes Feb 23, 2026

View reviewed changes

matthiasdiener added 4 commits February 23, 2026 12:43

fix copyright header

f680d6a

simplify calls in dispatch_grouped

6d85088

remove is_mi3*0_class

7910038

disable unused constants

e8ebb0e

matthiasdiener requested a review from ipanfilo February 23, 2026 22:53

matthiasdiener added 2 commits February 24, 2026 10:16

Merge remote-tracking branch 'origin/dev' into ck-grouped-gemm

deb7474

add another fallback

e866bc6

wangye805 requested changes Feb 24, 2026

View reviewed changes

ipanfilo reviewed Feb 24, 2026

View reviewed changes

matthiasdiener added 4 commits February 25, 2026 10:43

implement Primus-Turbo selection logic, persistent descs

ee438fb

Merge remote-tracking branch 'origin/dev' into ck-grouped-gemm

b65dbfa

tighten tolerances

0cbf1cd

use namespace, various cleanups

98e0c66

ipanfilo reviewed Feb 26, 2026

View reviewed changes

matthiasdiener added 4 commits February 26, 2026 15:22

avoid creating vector with Tensors

36bd68e

Merge remote-tracking branch 'origin/dev' into ck-grouped-gemm

070c58d

merge dispatch_grouped into ck_tile_grouped_gemm

c5d83a4

Merge remote-tracking branch 'origin/dev' into ck-grouped-gemm

56afb04

wangye805 approved these changes Feb 27, 2026

View reviewed changes

same tolerances for gfx950

26dfbb6

matthiasdiener requested a review from ipanfilo February 27, 2026 17:35

add to readme

7b1dbfa

Added information about CK_Tile-based grouped GEMM implementation and how to enable it.

		@@ -0,0 +1,276 @@
		/* Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved. */

		@@ -0,0 +1,11 @@
		/* Copyright (c) 2026, Advanced Micro Devices, Inc. All rights reserved. */

Conversation

matthiasdiener commented Jan 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matthiasdiener Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

matthiasdiener commented Jan 28, 2026 •

edited

Loading

matthiasdiener Feb 26, 2026 •

edited

Loading