Skip to content

Split grouped conv fwd instances#3449

Merged
wj-laskowski merged 1 commit intostreamhpc/grouped-conv-fwd-wmmafrom
streamhpc/grouped-conv-fwd-wmma-split-conv-fwd-instances
Dec 17, 2025
Merged

Split grouped conv fwd instances#3449
wj-laskowski merged 1 commit intostreamhpc/grouped-conv-fwd-wmmafrom
streamhpc/grouped-conv-fwd-wmma-split-conv-fwd-instances

Conversation

@wj-laskowski
Copy link
Contributor

Proposed changes

Please describe the motivation behind the pull request, whether it enables a new feature or fixes a bug. If there are associated pull requests or issues, please link them to the pull request.

Checklist

Please put an x into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask.

  • I have added tests relevant to the introduced functionality, and the unit tests are passing locally
  • I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, IF the test takes more than 30 seconds to run.
  • I have added inline documentation which enables the maintainers with understanding the motivation
  • I have removed the stale documentation which is no longer relevant after this pull request
  • (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request
  • I have run clang-format on all changed files
  • Any dependent changes have been merged

Discussion

If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered

@wj-laskowski wj-laskowski changed the title Streamhpc/grouped conv fwd wmma split conv fwd instances Split grouped conv fwd instances Dec 17, 2025
@krithalith krithalith force-pushed the streamhpc/grouped-conv-fwd-wmma branch from a69b386 to 147d0f6 Compare December 17, 2025 13:14
@wj-laskowski wj-laskowski force-pushed the streamhpc/grouped-conv-fwd-wmma-split-conv-fwd-instances branch from 5082933 to 3f20be1 Compare December 17, 2025 13:29
@krithalith krithalith self-requested a review December 17, 2025 14:09
Copy link
Contributor

@krithalith krithalith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, glad we got rid of the sharding system.

@krithalith krithalith force-pushed the streamhpc/grouped-conv-fwd-wmma branch from 147d0f6 to 1983e16 Compare December 17, 2025 14:31
@wj-laskowski wj-laskowski force-pushed the streamhpc/grouped-conv-fwd-wmma-split-conv-fwd-instances branch from a3094f9 to f304498 Compare December 17, 2025 14:38
@wj-laskowski wj-laskowski marked this pull request as ready for review December 17, 2025 14:38
@wj-laskowski wj-laskowski merged commit 4fca16d into streamhpc/grouped-conv-fwd-wmma Dec 17, 2025
5 of 6 checks passed
@wj-laskowski wj-laskowski deleted the streamhpc/grouped-conv-fwd-wmma-split-conv-fwd-instances branch December 17, 2025 14:38
afagaj pushed a commit that referenced this pull request Dec 18, 2025
…r RDNA3/4 (#2964)

* Fixed typos for padded instances

* Added tests for fp16, KM_KN and KM_NK

* Padding not supported for when BDataType is pk_i4_t. Added fix for correct check and removed padding instances.

* Fixed typos

* Updated the set of tests for FP16

* Updated the set of tests for FP16

* Fix typo

* Moved f16xi4 test under the correct data layout group

* example for gemm_universal_bf16

* Adding examples for gemm_wmma instances

* Added the  missing parameters

* Fixed review comments and added executable to cmakeLists

* Fixing clang format

* Fixing build erros

* Fixed compilation failure.

* Modified some code as per gemm_universal_examples

* Fixed the gemm specialization error

* Fixed the build errors.

* Fix strides of a/b_thread_desc

The descriptors are larger than needed (even though the compiler don't alloc registers for unused values).

* Load in M/NRepeat dims with thread copy's slice instead of a loop

* Clone BlockwiseGemmXdlops_pipeline_v1 for WMMA implementation

* Implement Intrawave and Interwave variants of pipeline v1

* Add instances for Interwave and Intrawave v1

* Add instances with ABlockLdsExtraM and BBlockLdsExtraN = 0

* Remove instances that are too slow (mostly because of register spilling)

* Add a workaround for fp8/bf8->f32 packed conversion issue

* Add instances for Interwave and Intrawave v1

* Enable profiling of mixed precision with f8 and int4 on WMMA

* Fix segfault in profiler when B is pk_i4_t

b_device_buf's size in bytes is larger than b_k_n_permute so b_device_buf.ToDevice reads out-of-bounds.

* Remove instances that are too slow (mostly because of register spilling)

* Add missing add_device_gemm_wmma_universal_f8_f8_bf16 declarations

* Add test case for bf16_i4

* Add missing Regular tests

* Add test_gemm_universal_xdl/wmma_fp16 to REGRESSION_TESTS

They take more than 30 seconds

* Fix a bug that fp16_i4 validation passes only with PermuteB

A permutation required by conversion from pk_i4_t to half_t does not
depend on PermuteB, they can be used independently.

* Use PermuteB with f16_i4 in most instances (as xdl)

Some instances use PermuteB = false for checking correctness.
See also the previous commit.

* Fix cache flushing for pk_i4

* Add mixed precision examples

* Disable all tests and instances with f8 on gfx11

Even though f8_f16 and f16_f8 don't require f8 WMMA instructions,
gfx11 still lacks hardware instructions for fast f8->f32 conversion.

* Add FP16 KM_NK and KM_KN test suites for XDL

These tests were added to common .inc for better testing of WMMA instances

* Support multiple D in GridwiseGemm_wmma_cshuffle_v3

DeviceGemm_Wmma_CShuffleV3 is changed for new template parameters.

* Use ThreadGroupTensorSliceTransfer_v7r3

* Clone for device_gemm_wmma_cshuffle_v3.hpp for future Multiple D support

* Clone example/65_gemm_multiply_multiply/gemm_add_add_xdl_fp16.cpp for wmma

* Implement DeviceGemmMultipleD_Wmma_CShuffleV3

* Make gemm_add_add_wmma to work with DeviceGemmMultipleD_Wmma_CShuffleV3

* Prepare gemma_add tests for adding wmma

* Add gemm_add_fastgelu instances and test

* Add a special wrapper to use DeviceGemmMultipleD_Wmma_CShuffleV3 with old API

ckProfiler uses DeviceGemmMultipleD (tests also call its functions), the wrapper allows to use
DeviceGemmMultipleDSplitK instances there.

* removed unnecessary ck parts from compilation

* initial gemm_add_multiply instance implementations

* fixed profiler help message for gemm_add_multiply

* improved multiply_add profiler layout help

* fixed template arguments for test instances

* added test for gemm_add_multiply

* Support multiple D in GridwiseGemm_wmma_cshuffle_v3

DeviceGemm_Wmma_CShuffleV3 is changed for new template parameters.

* Use ThreadGroupTensorSliceTransfer_v7r3

* Clone for device_gemm_wmma_cshuffle_v3.hpp for future Multiple D support

* Clone example/65_gemm_multiply_multiply/gemm_add_add_xdl_fp16.cpp for wmma

* Implement DeviceGemmMultipleD_Wmma_CShuffleV3

* Make gemm_add_add_wmma to work with DeviceGemmMultipleD_Wmma_CShuffleV3

* Prepare gemma_add tests for adding wmma

* Add gemm_add_fastgelu instances and test

* Add a special wrapper to use DeviceGemmMultipleD_Wmma_CShuffleV3 with old API

ckProfiler uses DeviceGemmMultipleD (tests also call its functions), the wrapper allows to use
DeviceGemmMultipleDSplitK instances there.

* switched to splitK interface

* log print added to splitk benchmarks

* revert main cmake comments

* newline change reverted

* added add_fastgelu instances

* revert unintended change in xdl add_fastgelu

* created gemm_add_add_fastgelu instances

* created fastegelu instances

* added tests for all splitk fastgelus

* Added tests.

* multiply_add instances created

* updates to add_multiply splitk instances

* splitk xdl test fixes

* added wmma multiply_multiply instances

* fixed ONLY_XDL_AND_WMMA_KERNELS tag

* Added gemm_add examples for wmma v1 and v3

* fixed / workarounded i8 instances

* Modified the v3 code to added one fp16 bxdl instance.

* added bf16 xdl instance.

* adding gemm_add wmma_cshuffle and other support


(cherry picked from commit ec447e7f564095ea969eddc39ec77b843aa52976)

Co-authored-by: Cenxuan <cenxuan@streamhpc.com>

* add instances into camkelists


(cherry picked from commit 23bf2d2771c939ea3ca7f493433c55255bffd08e)

Co-authored-by: Cenxuan <cenxuan@streamhpc.com>

* This is work in progress, edited the template parameters in order to build

(cherry picked from commit b4fde8a3314cb44659c4bbda35f1a0133c63dc41)

Co-authored-by: Cenxuan <cenxuan@streamhpc.com>

* temp work saved, changed the BDataType to f16 or bf16 since wmma currently not support non-equal A and B datatype


(cherry picked from commit 22fbd68f1db458ab50780a394ee2544c7a1484d1)

Co-authored-by: Cenxuan <cenxuan@streamhpc.com>

* added datatype and use clang-format-12


(cherry picked from commit ae4e853682ef1bb27784b2f965b4a66b3751ceec)

Co-authored-by: Cenxuan <cenxuan@streamhpc.com>

* Fixing build errors

* Added instances for v3

* Adding instances and executables

* Code update of template parameters modified.

* Renamed file.

* Added tests.

* resolved error tests.

* Fixing build errors

* Updated comments

* removed the changes as per the MR review comment.

* Updated tests.

* fp8 instances - not tested

* Restored the Cmake file that was reverted by mistake during rebase.

* fixed wmma_op test

* Updated comments.

* Updated the template parameter description

* fixed rdna4 instances

* fixed back compatibility on gfx11

* cleanups

* fix ckProfiler

* one more cmake fix

* added fp8 instances

* Updated tests to ad BF16 instances as per review comment

* Added include file and cleaned up(as per review comment)

* Updated and optimized the example code for all types.

* Fixed clang format

* Resolve "Implement `device_gemm_bilinear` for RDNA4"

* test generalization to handle FP16 shuffle better

* added missing changes

* Added bf16 wmma instance for add_relu

* Added f16 wmma instance and corrected bf16 instance errors.

* Added instances to Cmake

* Modified the template parameters to make the instances work.

* Fixed typo in profiler

* Added v3 instances for gemm_add_relu

* addressed core review comments

* Added test for gemm_add_relu wmma instance

* Cleaned up the code.

* Added examples for gemm_add_relu

* Fixing typo to resolve build errors.

* Fixes applied to fix  the precision loss.

* fix billinear test after merge

* Removed the old wmma instances.

* Added wrapper and renamed the wmma_v3 instances

* Updated copyrights and added wrappers.

* Fixes applied according to review comments

* Apply 1 suggestion(s) to 1 file(s)

Co-authored-by: Robin Voetter <robin@streamhpc.com>

* Removed the old wmma instances.

* Updated wrapper for the v3 instances

* removed the old wmma examples

* Renamed the v3 instances

* Deleted the  gtest file added by mistake.

* Updated thge profiler with wrapper

* Fixed test errors.

* Fixed the review comments

* Fixed the if condition MACROS.

* REVERTED THE PROFILER CHANGES

* Revert "REVERTED THE PROFILER CHANGES"

This reverts commit 21cb985.

* Revert "Fixed test errors."

This reverts commit 13efcc6.

* Revert "Updated thge profiler with wrapper"

This reverts commit 536f866.

* Added missing wrapper instances

* Updated copyrights.

* Fixed typo.

* Fixed copyrights.

* Updated copyrights.

* updated copyrights.

* comments on the atomics workaround

* fixed cmake comment

* Fix bug from merge

* clang-format-18

* Fix compilation error

* multi_abd wmma support:

 - Add multiple A and B support to multiple D implementation (gridwise level)
 - Add multi_abd GEMM (device level)
 - Add instances (xdl parity)
 - Add tests (both xdl and wmma)
 - Add examples
 - Add ckProfiler support (both xdl and wmma)

* Fix bug in device print function

* Fix unused template parameter

* Add support for fwd conv in gridwise implementation. Identical to run function for bwd data.

* Initial device implementation for grouped conv fwd multiABD wmma cshuffleV3. Functional but needs some fixups and extra features in the future.

* Make relevant profilers print the number of valid instances to aid testing.

* Add instances for all vanilla 2D and 3D flavors for f16 and bf16, only one instance per instance list to save compile time for now.  Also added incomplete set of comp instances and bias_clamp for f16 2D, just to make sure the multiple-D aspects of the device implementation are working.

* Reset output buffer after each run in profile_grouped_conv_fwd_impl().

* Disable sharding for the new instances for now, has tendency to lead to linker errors on repeat builds.

* Add CTranspose optimization for NCHW cases just like in xdl cshuffle non-v3 device implementation.

* Add instances for all 8-bit 3D vanilla grouped conv fwd types, including mixed types but with the exception of deprecated f16 comp fp8. Adapt test so we can test 8-bit and mixed types.

* Add int8 instances for 2D vanilla grouped conv fwd all layouts.

* Implement merged groups in device impl and add instances for merged groups 3D vanilla conv fwd

* Add merged groups instances for all 2D vanilla grouped conv fwd types and layouts.

* Implement multi-AB support for grouped conv fwd and add example.

* Add 1D instances

* Add D layout tests to IsSupportedArgument()

* Add comp and mem instances for all vanilla 2D grouped conv fwd types. Skipping "x2" and "part2" instance lists, can be added later without special names if necessary.

* Add comp and mem instances for vanilla 3D grouped conv fwd. Skipped 2x and part2 instances, can be added later in the same instance lists.

* Add some more tests for vanilla grouped conv fwd

* Add 2D bias clamp instances and tests

* Add 3D bias clamp instances and tests

* Add 2D and 3D clamp instances and tests

* Unify problem sizes across vanilla and clamp flavor tests

* Clean up device implementation: remove old todos, remove unnecessary comments and print statements, tweak description, wrap all prints in env check.

* Implement rotating memory and flush cache. Requires ad-hoc buffer size calculations.

* Remove wmma fp8 and bf8 instances when not targetting gfx12

* Add newer instances to DEVICE_INSTANCES so the main ckProfiler can build

* Remove old years for newly created files.

* No need to time kernels for now.

* Fixup comments

* Pass struct args to Gridwise Run() function by reference.

* Don't use workspace memory in the case where A needs explicit transposition but B does not.

* Move calculation of rotating memory buffer sizes to Argument member functions.

* After the convolution to gemm transformation, the resulting 2D tensor descriptors are not necessarily RowMajor or ColumnMajor, so things should not rely on this distinction. Therefore, pass all RowMajor to the Gridwise and use a special version of CheckValidity that does not rely on 2D tensor layouts.

* Unify xdl and wmma example code for grouped conv fwd scaleadd ab

* Go back to passing RCR 2D tensor layouts to gridwise gemm, and use CRC for the CTranspose case. Also remove the special convolution version of checkValidity(). It seems like no matter what 2D tensor layouts you pass to the gridwise gemm, and no matter if you are using extraMN, and no matter if you are using the convolution version of checkvalidity, the results of all tests are the same.

* Add wmma scaleadd ab instances to the device factory and add a completely new scaleadd_ab gtest test for wmma cshufflev3 and xdl. Currently there is no profiler for scaleadd_ab so I made my own inside the test. Furthermore for XDL only the (NDHWGC, GKZYXC, NDHWGK) layout combination existed in the instance factory so that is the only one I added for wmma cshufflev3 and the gtest test as well. Another layout is tested in example 62, for xdl and wmma cshufflev3.

* Add support for V3 pipeline (tested). To be able to support num_loop < 3 we need the fixes from the batched gemm gemm MR which was already merged upstream, so just need to rebase or merge.

* Small post-merge fixup, everything seems to work.

* Do not build or run Xdl operations with Wmma backend for now. Will be reverted before upstreaming.

* Extend scaleadd_ab instance lists

* Extend merged groups instance lists, including adaptations of xdl "2x" instances.

* Extend "comp" instance lists, including "2x" and "part2" instances. 2x instances disabled for now since they do not compile.

* Extend "mem" instance lists.

* Extend regular instance lists.

* Fixup comments and ignored kernel arg name

* Properly use the splitN offsets for D tensors in the gridwise Run() function. Was necessary to pass the bias_clamp_large_cases test.

* Make sure all strides in ComputePtrOffset are at least value initialized to avoid undefined strides. Not convinced this struct is properly initialized in other code / future code.

* Re-enable sharding for wmma cshufflev3 instances

* Post merge fix to vanilla test

* Optionally allow num_k_loop <= PrefetchStages in gridwise CheckValidity. Use this for grouped conv fwd but not in general.

* Remove spurious ck_tile changes that were presumably introduced somewhere in the repeated merging from develop.

* Post-merge fixes. Make sure the new gridwise gemm wmma v3 common Run function can be used. Remove splitK, and forceThreadTileTransfer for now. Also add CShuffle epilogue argument.

* Disable FP8 / BF8 testing on CDNA1/2, it doesn't work anymore and needs to be either fixed or removed.

* Re-enable old wmma instances

* Re-enable Linqun's Xdl Wmma instances

* Small post-merge fixes

* Fix copyright headers

* Remove commented code snippet in gridwise

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Limit the explicit cast added in threadwise_tensor_slice_transfer_v7r3 to only be used for f8, just in case it hurts performance.

* Adding tuned instace list for groupoed conv fwd (#3288)

Following flavors are updated with tuned instance list:
  - grouped_conv2d_fwd
  - grouped_conv2d_fwd_bias_clamp
  - grouped_conv2d_fwd_clamp
  - grouped_conv3d_fwd
  - grouped_conv3d_fwd_bias_clamp
  - grouped_conv3d_fwd_clamp
  - grouped_conv3d_fwd_scaleadd_ab

Re-factored instance selection:
  - removed all the unnecessary instance tuples (comp/mem/16x16/generic)
  - removed all unnecessary layouts and data types

* Do not use std::remove_cvref_t, does not exist in C++17, use custom one.

* Splitting grouped conv fwd instances (#3449)

* Disable unnecessary and failing tests related to experimental CK builder

* Disable unnecessary ck builder experimental tests fully

---------

Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
Co-authored-by: apoorva <apoorva@streamhpc.com>
Co-authored-by: Anton Gorenko <anton@streamhpc.com>
Co-authored-by: Zoltan Lakatos <zoltan.lakatos@streamhpc.com>
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
Co-authored-by: Robin Voetter <robin@streamhpc.com>
Co-authored-by: Enrico Degregori <enrico@streamhpc.com>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
Co-authored-by: Wojciech Laskowski <77888887+wj-laskowski@users.noreply.github.com>
afagaj pushed a commit that referenced this pull request Dec 19, 2025
* Padding not supported for when BDataType is pk_i4_t. Added fix for correct check and removed padding instances.

* Fixed typos

* Updated the set of tests for FP16

* Updated the set of tests for FP16

* Fix typo

* Moved f16xi4 test under the correct data layout group

* example for gemm_universal_bf16

* Adding examples for gemm_wmma instances

* Added the  missing parameters

* Fixed review comments and added executable to cmakeLists

* Fixing clang format

* Fixing build erros

* Fixed compilation failure.

* Modified some code as per gemm_universal_examples

* Fixed the gemm specialization error

* Fixed the build errors.

* Fix strides of a/b_thread_desc

The descriptors are larger than needed (even though the compiler don't alloc registers for unused values).

* Load in M/NRepeat dims with thread copy's slice instead of a loop

* Clone BlockwiseGemmXdlops_pipeline_v1 for WMMA implementation

* Implement Intrawave and Interwave variants of pipeline v1

* Add instances for Interwave and Intrawave v1

* Add instances with ABlockLdsExtraM and BBlockLdsExtraN = 0

* Remove instances that are too slow (mostly because of register spilling)

* Add a workaround for fp8/bf8->f32 packed conversion issue

* Add instances for Interwave and Intrawave v1

* Enable profiling of mixed precision with f8 and int4 on WMMA

* Fix segfault in profiler when B is pk_i4_t

b_device_buf's size in bytes is larger than b_k_n_permute so b_device_buf.ToDevice reads out-of-bounds.

* Remove instances that are too slow (mostly because of register spilling)

* Add missing add_device_gemm_wmma_universal_f8_f8_bf16 declarations

* Add test case for bf16_i4

* Add missing Regular tests

* Add test_gemm_universal_xdl/wmma_fp16 to REGRESSION_TESTS

They take more than 30 seconds

* Fix a bug that fp16_i4 validation passes only with PermuteB

A permutation required by conversion from pk_i4_t to half_t does not
depend on PermuteB, they can be used independently.

* Use PermuteB with f16_i4 in most instances (as xdl)

Some instances use PermuteB = false for checking correctness.
See also the previous commit.

* Fix cache flushing for pk_i4

* Add mixed precision examples

* Disable all tests and instances with f8 on gfx11

Even though f8_f16 and f16_f8 don't require f8 WMMA instructions,
gfx11 still lacks hardware instructions for fast f8->f32 conversion.

* Add FP16 KM_NK and KM_KN test suites for XDL

These tests were added to common .inc for better testing of WMMA instances

* Support multiple D in GridwiseGemm_wmma_cshuffle_v3

DeviceGemm_Wmma_CShuffleV3 is changed for new template parameters.

* Use ThreadGroupTensorSliceTransfer_v7r3

* Clone for device_gemm_wmma_cshuffle_v3.hpp for future Multiple D support

* Clone example/65_gemm_multiply_multiply/gemm_add_add_xdl_fp16.cpp for wmma

* Implement DeviceGemmMultipleD_Wmma_CShuffleV3

* Make gemm_add_add_wmma to work with DeviceGemmMultipleD_Wmma_CShuffleV3

* Prepare gemma_add tests for adding wmma

* Add gemm_add_fastgelu instances and test

* Add a special wrapper to use DeviceGemmMultipleD_Wmma_CShuffleV3 with old API

ckProfiler uses DeviceGemmMultipleD (tests also call its functions), the wrapper allows to use
DeviceGemmMultipleDSplitK instances there.

* removed unnecessary ck parts from compilation

* initial gemm_add_multiply instance implementations

* fixed profiler help message for gemm_add_multiply

* improved multiply_add profiler layout help

* fixed template arguments for test instances

* added test for gemm_add_multiply

* Support multiple D in GridwiseGemm_wmma_cshuffle_v3

DeviceGemm_Wmma_CShuffleV3 is changed for new template parameters.

* Use ThreadGroupTensorSliceTransfer_v7r3

* Clone for device_gemm_wmma_cshuffle_v3.hpp for future Multiple D support

* Clone example/65_gemm_multiply_multiply/gemm_add_add_xdl_fp16.cpp for wmma

* Implement DeviceGemmMultipleD_Wmma_CShuffleV3

* Make gemm_add_add_wmma to work with DeviceGemmMultipleD_Wmma_CShuffleV3

* Prepare gemma_add tests for adding wmma

* Add gemm_add_fastgelu instances and test

* Add a special wrapper to use DeviceGemmMultipleD_Wmma_CShuffleV3 with old API

ckProfiler uses DeviceGemmMultipleD (tests also call its functions), the wrapper allows to use
DeviceGemmMultipleDSplitK instances there.

* switched to splitK interface

* log print added to splitk benchmarks

* revert main cmake comments

* newline change reverted

* added add_fastgelu instances

* revert unintended change in xdl add_fastgelu

* created gemm_add_add_fastgelu instances

* created fastegelu instances

* added tests for all splitk fastgelus

* Added tests.

* multiply_add instances created

* updates to add_multiply splitk instances

* splitk xdl test fixes

* added wmma multiply_multiply instances

* fixed ONLY_XDL_AND_WMMA_KERNELS tag

* Added gemm_add examples for wmma v1 and v3

* fixed / workarounded i8 instances

* Modified the v3 code to added one fp16 bxdl instance.

* added bf16 xdl instance.

* adding gemm_add wmma_cshuffle and other support


(cherry picked from commit ec447e7f564095ea969eddc39ec77b843aa52976)

Co-authored-by: Cenxuan <cenxuan@streamhpc.com>

* add instances into camkelists


(cherry picked from commit 23bf2d2771c939ea3ca7f493433c55255bffd08e)

Co-authored-by: Cenxuan <cenxuan@streamhpc.com>

* This is work in progress, edited the template parameters in order to build

(cherry picked from commit b4fde8a3314cb44659c4bbda35f1a0133c63dc41)

Co-authored-by: Cenxuan <cenxuan@streamhpc.com>

* temp work saved, changed the BDataType to f16 or bf16 since wmma currently not support non-equal A and B datatype


(cherry picked from commit 22fbd68f1db458ab50780a394ee2544c7a1484d1)

Co-authored-by: Cenxuan <cenxuan@streamhpc.com>

* added datatype and use clang-format-12


(cherry picked from commit ae4e853682ef1bb27784b2f965b4a66b3751ceec)

Co-authored-by: Cenxuan <cenxuan@streamhpc.com>

* Fixing build errors

* Added instances for v3

* Adding instances and executables

* Code update of template parameters modified.

* Renamed file.

* Added tests.

* resolved error tests.

* Fixing build errors

* Updated comments

* removed the changes as per the MR review comment.

* Updated tests.

* fp8 instances - not tested

* Restored the Cmake file that was reverted by mistake during rebase.

* fixed wmma_op test

* Updated comments.

* Updated the template parameter description

* fixed rdna4 instances

* fixed back compatibility on gfx11

* cleanups

* fix ckProfiler

* one more cmake fix

* added fp8 instances

* Updated tests to ad BF16 instances as per review comment

* Added include file and cleaned up(as per review comment)

* Updated and optimized the example code for all types.

* Fixed clang format

* Resolve "Implement `device_gemm_bilinear` for RDNA4"

* test generalization to handle FP16 shuffle better

* added missing changes

* Added bf16 wmma instance for add_relu

* Added f16 wmma instance and corrected bf16 instance errors.

* Added instances to Cmake

* Modified the template parameters to make the instances work.

* Fixed typo in profiler

* Added v3 instances for gemm_add_relu

* addressed core review comments

* Added test for gemm_add_relu wmma instance

* Cleaned up the code.

* Added examples for gemm_add_relu

* Fixing typo to resolve build errors.

* Fixes applied to fix  the precision loss.

* fix billinear test after merge

* Removed the old wmma instances.

* Added wrapper and renamed the wmma_v3 instances

* Updated copyrights and added wrappers.

* Fixes applied according to review comments

* Apply 1 suggestion(s) to 1 file(s)

Co-authored-by: Robin Voetter <robin@streamhpc.com>

* Removed the old wmma instances.

* Updated wrapper for the v3 instances

* removed the old wmma examples

* Renamed the v3 instances

* Deleted the  gtest file added by mistake.

* Updated thge profiler with wrapper

* Fixed test errors.

* Fixed the review comments

* Fixed the if condition MACROS.

* REVERTED THE PROFILER CHANGES

* Revert "REVERTED THE PROFILER CHANGES"

This reverts commit 21cb985.

* Revert "Fixed test errors."

This reverts commit 13efcc6.

* Revert "Updated thge profiler with wrapper"

This reverts commit 536f866.

* Added missing wrapper instances

* Updated copyrights.

* Fixed typo.

* Fixed copyrights.

* Updated copyrights.

* updated copyrights.

* comments on the atomics workaround

* fixed cmake comment

* Fix bug from merge

* clang-format-18

* Fix compilation error

* multi_abd wmma support:

 - Add multiple A and B support to multiple D implementation (gridwise level)
 - Add multi_abd GEMM (device level)
 - Add instances (xdl parity)
 - Add tests (both xdl and wmma)
 - Add examples
 - Add ckProfiler support (both xdl and wmma)

* Fix bug in device print function

* Fix unused template parameter

* Add support for fwd conv in gridwise implementation. Identical to run function for bwd data.

* Initial device implementation for grouped conv fwd multiABD wmma cshuffleV3. Functional but needs some fixups and extra features in the future.

* Make relevant profilers print the number of valid instances to aid testing.

* Add instances for all vanilla 2D and 3D flavors for f16 and bf16, only one instance per instance list to save compile time for now.  Also added incomplete set of comp instances and bias_clamp for f16 2D, just to make sure the multiple-D aspects of the device implementation are working.

* Reset output buffer after each run in profile_grouped_conv_fwd_impl().

* Disable sharding for the new instances for now, has tendency to lead to linker errors on repeat builds.

* Add CTranspose optimization for NCHW cases just like in xdl cshuffle non-v3 device implementation.

* Add instances for all 8-bit 3D vanilla grouped conv fwd types, including mixed types but with the exception of deprecated f16 comp fp8. Adapt test so we can test 8-bit and mixed types.

* Add int8 instances for 2D vanilla grouped conv fwd all layouts.

* Implement merged groups in device impl and add instances for merged groups 3D vanilla conv fwd

* Add merged groups instances for all 2D vanilla grouped conv fwd types and layouts.

* Implement multi-AB support for grouped conv fwd and add example.

* Add 1D instances

* Add D layout tests to IsSupportedArgument()

* Add comp and mem instances for all vanilla 2D grouped conv fwd types. Skipping "x2" and "part2" instance lists, can be added later without special names if necessary.

* Add comp and mem instances for vanilla 3D grouped conv fwd. Skipped 2x and part2 instances, can be added later in the same instance lists.

* Add some more tests for vanilla grouped conv fwd

* Add 2D bias clamp instances and tests

* Add 3D bias clamp instances and tests

* Add 2D and 3D clamp instances and tests

* Unify problem sizes across vanilla and clamp flavor tests

* Clean up device implementation: remove old todos, remove unnecessary comments and print statements, tweak description, wrap all prints in env check.

* Implement rotating memory and flush cache. Requires ad-hoc buffer size calculations.

* Remove wmma fp8 and bf8 instances when not targetting gfx12

* Add newer instances to DEVICE_INSTANCES so the main ckProfiler can build

* Remove old years for newly created files.

* No need to time kernels for now.

* Fixup comments

* Pass struct args to Gridwise Run() function by reference.

* Don't use workspace memory in the case where A needs explicit transposition but B does not.

* Move calculation of rotating memory buffer sizes to Argument member functions.

* After the convolution to gemm transformation, the resulting 2D tensor descriptors are not necessarily RowMajor or ColumnMajor, so things should not rely on this distinction. Therefore, pass all RowMajor to the Gridwise and use a special version of CheckValidity that does not rely on 2D tensor layouts.

* Unify xdl and wmma example code for grouped conv fwd scaleadd ab

* Go back to passing RCR 2D tensor layouts to gridwise gemm, and use CRC for the CTranspose case. Also remove the special convolution version of checkValidity(). It seems like no matter what 2D tensor layouts you pass to the gridwise gemm, and no matter if you are using extraMN, and no matter if you are using the convolution version of checkvalidity, the results of all tests are the same.

* Add wmma scaleadd ab instances to the device factory and add a completely new scaleadd_ab gtest test for wmma cshufflev3 and xdl. Currently there is no profiler for scaleadd_ab so I made my own inside the test. Furthermore for XDL only the (NDHWGC, GKZYXC, NDHWGK) layout combination existed in the instance factory so that is the only one I added for wmma cshufflev3 and the gtest test as well. Another layout is tested in example 62, for xdl and wmma cshufflev3.

* Add support for V3 pipeline (tested). To be able to support num_loop < 3 we need the fixes from the batched gemm gemm MR which was already merged upstream, so just need to rebase or merge.

* Small post-merge fixup, everything seems to work.

* Do not build or run Xdl operations with Wmma backend for now. Will be reverted before upstreaming.

* Extend scaleadd_ab instance lists

* Extend merged groups instance lists, including adaptations of xdl "2x" instances.

* Extend "comp" instance lists, including "2x" and "part2" instances. 2x instances disabled for now since they do not compile.

* Extend "mem" instance lists.

* Extend regular instance lists.

* Fixup comments and ignored kernel arg name

* Properly use the splitN offsets for D tensors in the gridwise Run() function. Was necessary to pass the bias_clamp_large_cases test.

* Make sure all strides in ComputePtrOffset are at least value initialized to avoid undefined strides. Not convinced this struct is properly initialized in other code / future code.

* Re-enable sharding for wmma cshufflev3 instances

* Post merge fix to vanilla test

* Optionally allow num_k_loop <= PrefetchStages in gridwise CheckValidity. Use this for grouped conv fwd but not in general.

* Remove spurious ck_tile changes that were presumably introduced somewhere in the repeated merging from develop.

* Post-merge fixes. Make sure the new gridwise gemm wmma v3 common Run function can be used. Remove splitK, and forceThreadTileTransfer for now. Also add CShuffle epilogue argument.

* Disable FP8 / BF8 testing on CDNA1/2, it doesn't work anymore and needs to be either fixed or removed.

* Re-enable old wmma instances

* Re-enable Linqun's Xdl Wmma instances

* Small post-merge fixes

* Fix copyright headers

* Remove commented code snippet in gridwise

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Limit the explicit cast added in threadwise_tensor_slice_transfer_v7r3 to only be used for f8, just in case it hurts performance.

* Adding tuned instace list for groupoed conv fwd (#3288)

Following flavors are updated with tuned instance list:
  - grouped_conv2d_fwd
  - grouped_conv2d_fwd_bias_clamp
  - grouped_conv2d_fwd_clamp
  - grouped_conv3d_fwd
  - grouped_conv3d_fwd_bias_clamp
  - grouped_conv3d_fwd_clamp
  - grouped_conv3d_fwd_scaleadd_ab

Re-factored instance selection:
  - removed all the unnecessary instance tuples (comp/mem/16x16/generic)
  - removed all unnecessary layouts and data types

* Do not use std::remove_cvref_t, does not exist in C++17, use custom one.

* Splitting grouped conv fwd instances (#3449)

* Disable unnecessary and failing tests related to experimental CK builder

* Disable unnecessary ck builder experimental tests fully

* Added large tensor support for grouped conv fwd wmma

---------

Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
Co-authored-by: apoorva <apoorva@streamhpc.com>
Co-authored-by: Anton Gorenko <anton@streamhpc.com>
Co-authored-by: Zoltan Lakatos <zoltan.lakatos@streamhpc.com>
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
Co-authored-by: Robin Voetter <robin@streamhpc.com>
Co-authored-by: Enrico Degregori <enrico@streamhpc.com>
Co-authored-by: Kiefer van Teutem <kiefer.van.teutem@streamhpc.com>
Co-authored-by: Kiefer van Teutem <50830967+krithalith@users.noreply.github.com>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
wj-laskowski added a commit that referenced this pull request Dec 19, 2025
…3450)

* Updated the set of tests for FP16

* Fix typo

* Moved f16xi4 test under the correct data layout group

* example for gemm_universal_bf16

* Adding examples for gemm_wmma instances

* Added the  missing parameters

* Fixed review comments and added executable to cmakeLists

* Fixing clang format

* Fixing build erros

* Fixed compilation failure.

* Modified some code as per gemm_universal_examples

* Fixed the gemm specialization error

* Fixed the build errors.

* Fix strides of a/b_thread_desc

The descriptors are larger than needed (even though the compiler don't alloc registers for unused values).

* Load in M/NRepeat dims with thread copy's slice instead of a loop

* Clone BlockwiseGemmXdlops_pipeline_v1 for WMMA implementation

* Implement Intrawave and Interwave variants of pipeline v1

* Add instances for Interwave and Intrawave v1

* Add instances with ABlockLdsExtraM and BBlockLdsExtraN = 0

* Remove instances that are too slow (mostly because of register spilling)

* Add a workaround for fp8/bf8->f32 packed conversion issue

* Add instances for Interwave and Intrawave v1

* Enable profiling of mixed precision with f8 and int4 on WMMA

* Fix segfault in profiler when B is pk_i4_t

b_device_buf's size in bytes is larger than b_k_n_permute so b_device_buf.ToDevice reads out-of-bounds.

* Remove instances that are too slow (mostly because of register spilling)

* Add missing add_device_gemm_wmma_universal_f8_f8_bf16 declarations

* Add test case for bf16_i4

* Add missing Regular tests

* Add test_gemm_universal_xdl/wmma_fp16 to REGRESSION_TESTS

They take more than 30 seconds

* Fix a bug that fp16_i4 validation passes only with PermuteB

A permutation required by conversion from pk_i4_t to half_t does not
depend on PermuteB, they can be used independently.

* Use PermuteB with f16_i4 in most instances (as xdl)

Some instances use PermuteB = false for checking correctness.
See also the previous commit.

* Fix cache flushing for pk_i4

* Add mixed precision examples

* Disable all tests and instances with f8 on gfx11

Even though f8_f16 and f16_f8 don't require f8 WMMA instructions,
gfx11 still lacks hardware instructions for fast f8->f32 conversion.

* Add FP16 KM_NK and KM_KN test suites for XDL

These tests were added to common .inc for better testing of WMMA instances

* Support multiple D in GridwiseGemm_wmma_cshuffle_v3

DeviceGemm_Wmma_CShuffleV3 is changed for new template parameters.

* Use ThreadGroupTensorSliceTransfer_v7r3

* Clone for device_gemm_wmma_cshuffle_v3.hpp for future Multiple D support

* Clone example/65_gemm_multiply_multiply/gemm_add_add_xdl_fp16.cpp for wmma

* Implement DeviceGemmMultipleD_Wmma_CShuffleV3

* Make gemm_add_add_wmma to work with DeviceGemmMultipleD_Wmma_CShuffleV3

* Prepare gemma_add tests for adding wmma

* Add gemm_add_fastgelu instances and test

* Add a special wrapper to use DeviceGemmMultipleD_Wmma_CShuffleV3 with old API

ckProfiler uses DeviceGemmMultipleD (tests also call its functions), the wrapper allows to use
DeviceGemmMultipleDSplitK instances there.

* removed unnecessary ck parts from compilation

* initial gemm_add_multiply instance implementations

* fixed profiler help message for gemm_add_multiply

* improved multiply_add profiler layout help

* fixed template arguments for test instances

* added test for gemm_add_multiply

* Support multiple D in GridwiseGemm_wmma_cshuffle_v3

DeviceGemm_Wmma_CShuffleV3 is changed for new template parameters.

* Use ThreadGroupTensorSliceTransfer_v7r3

* Clone for device_gemm_wmma_cshuffle_v3.hpp for future Multiple D support

* Clone example/65_gemm_multiply_multiply/gemm_add_add_xdl_fp16.cpp for wmma

* Implement DeviceGemmMultipleD_Wmma_CShuffleV3

* Make gemm_add_add_wmma to work with DeviceGemmMultipleD_Wmma_CShuffleV3

* Prepare gemma_add tests for adding wmma

* Add gemm_add_fastgelu instances and test

* Add a special wrapper to use DeviceGemmMultipleD_Wmma_CShuffleV3 with old API

ckProfiler uses DeviceGemmMultipleD (tests also call its functions), the wrapper allows to use
DeviceGemmMultipleDSplitK instances there.

* switched to splitK interface

* log print added to splitk benchmarks

* revert main cmake comments

* newline change reverted

* added add_fastgelu instances

* revert unintended change in xdl add_fastgelu

* created gemm_add_add_fastgelu instances

* created fastegelu instances

* added tests for all splitk fastgelus

* Added tests.

* multiply_add instances created

* updates to add_multiply splitk instances

* splitk xdl test fixes

* added wmma multiply_multiply instances

* fixed ONLY_XDL_AND_WMMA_KERNELS tag

* Added gemm_add examples for wmma v1 and v3

* fixed / workarounded i8 instances

* Modified the v3 code to added one fp16 bxdl instance.

* added bf16 xdl instance.

* adding gemm_add wmma_cshuffle and other support


(cherry picked from commit ec447e7f564095ea969eddc39ec77b843aa52976)

Co-authored-by: Cenxuan <cenxuan@streamhpc.com>

* add instances into camkelists


(cherry picked from commit 23bf2d2771c939ea3ca7f493433c55255bffd08e)

Co-authored-by: Cenxuan <cenxuan@streamhpc.com>

* This is work in progress, edited the template parameters in order to build

(cherry picked from commit b4fde8a3314cb44659c4bbda35f1a0133c63dc41)

Co-authored-by: Cenxuan <cenxuan@streamhpc.com>

* temp work saved, changed the BDataType to f16 or bf16 since wmma currently not support non-equal A and B datatype


(cherry picked from commit 22fbd68f1db458ab50780a394ee2544c7a1484d1)

Co-authored-by: Cenxuan <cenxuan@streamhpc.com>

* added datatype and use clang-format-12


(cherry picked from commit ae4e853682ef1bb27784b2f965b4a66b3751ceec)

Co-authored-by: Cenxuan <cenxuan@streamhpc.com>

* Fixing build errors

* Added instances for v3

* Adding instances and executables

* Code update of template parameters modified.

* Renamed file.

* Added tests.

* resolved error tests.

* Fixing build errors

* Updated comments

* removed the changes as per the MR review comment.

* Updated tests.

* fp8 instances - not tested

* Restored the Cmake file that was reverted by mistake during rebase.

* fixed wmma_op test

* Updated comments.

* Updated the template parameter description

* fixed rdna4 instances

* fixed back compatibility on gfx11

* cleanups

* fix ckProfiler

* one more cmake fix

* added fp8 instances

* Updated tests to ad BF16 instances as per review comment

* Added include file and cleaned up(as per review comment)

* Updated and optimized the example code for all types.

* Fixed clang format

* Resolve "Implement `device_gemm_bilinear` for RDNA4"

* test generalization to handle FP16 shuffle better

* added missing changes

* Added bf16 wmma instance for add_relu

* Added f16 wmma instance and corrected bf16 instance errors.

* Added instances to Cmake

* Modified the template parameters to make the instances work.

* Fixed typo in profiler

* Added v3 instances for gemm_add_relu

* addressed core review comments

* Added test for gemm_add_relu wmma instance

* Cleaned up the code.

* Added examples for gemm_add_relu

* Fixing typo to resolve build errors.

* Fixes applied to fix  the precision loss.

* fix billinear test after merge

* Removed the old wmma instances.

* Added wrapper and renamed the wmma_v3 instances

* Updated copyrights and added wrappers.

* Fixes applied according to review comments

* Apply 1 suggestion(s) to 1 file(s)

Co-authored-by: Robin Voetter <robin@streamhpc.com>

* Removed the old wmma instances.

* Updated wrapper for the v3 instances

* removed the old wmma examples

* Renamed the v3 instances

* Deleted the  gtest file added by mistake.

* Updated thge profiler with wrapper

* Fixed test errors.

* Fixed the review comments

* Fixed the if condition MACROS.

* REVERTED THE PROFILER CHANGES

* Revert "REVERTED THE PROFILER CHANGES"

This reverts commit 21cb985.

* Revert "Fixed test errors."

This reverts commit 13efcc6.

* Revert "Updated thge profiler with wrapper"

This reverts commit 536f866.

* Added missing wrapper instances

* Updated copyrights.

* Fixed typo.

* Fixed copyrights.

* Updated copyrights.

* updated copyrights.

* comments on the atomics workaround

* fixed cmake comment

* Fix bug from merge

* clang-format-18

* Fix compilation error

* multi_abd wmma support:

 - Add multiple A and B support to multiple D implementation (gridwise level)
 - Add multi_abd GEMM (device level)
 - Add instances (xdl parity)
 - Add tests (both xdl and wmma)
 - Add examples
 - Add ckProfiler support (both xdl and wmma)

* Fix bug in device print function

* Fix unused template parameter

* Add support for fwd conv in gridwise implementation. Identical to run function for bwd data.

* Initial device implementation for grouped conv fwd multiABD wmma cshuffleV3. Functional but needs some fixups and extra features in the future.

* Make relevant profilers print the number of valid instances to aid testing.

* Add instances for all vanilla 2D and 3D flavors for f16 and bf16, only one instance per instance list to save compile time for now.  Also added incomplete set of comp instances and bias_clamp for f16 2D, just to make sure the multiple-D aspects of the device implementation are working.

* Reset output buffer after each run in profile_grouped_conv_fwd_impl().

* Disable sharding for the new instances for now, has tendency to lead to linker errors on repeat builds.

* Add CTranspose optimization for NCHW cases just like in xdl cshuffle non-v3 device implementation.

* Add instances for all 8-bit 3D vanilla grouped conv fwd types, including mixed types but with the exception of deprecated f16 comp fp8. Adapt test so we can test 8-bit and mixed types.

* Add int8 instances for 2D vanilla grouped conv fwd all layouts.

* Implement merged groups in device impl and add instances for merged groups 3D vanilla conv fwd

* Add merged groups instances for all 2D vanilla grouped conv fwd types and layouts.

* Implement multi-AB support for grouped conv fwd and add example.

* Add 1D instances

* Add D layout tests to IsSupportedArgument()

* Add comp and mem instances for all vanilla 2D grouped conv fwd types. Skipping "x2" and "part2" instance lists, can be added later without special names if necessary.

* Add comp and mem instances for vanilla 3D grouped conv fwd. Skipped 2x and part2 instances, can be added later in the same instance lists.

* Add some more tests for vanilla grouped conv fwd

* Add 2D bias clamp instances and tests

* Add 3D bias clamp instances and tests

* Add 2D and 3D clamp instances and tests

* Unify problem sizes across vanilla and clamp flavor tests

* Clean up device implementation: remove old todos, remove unnecessary comments and print statements, tweak description, wrap all prints in env check.

* Implement rotating memory and flush cache. Requires ad-hoc buffer size calculations.

* Remove wmma fp8 and bf8 instances when not targetting gfx12

* Add newer instances to DEVICE_INSTANCES so the main ckProfiler can build

* Remove old years for newly created files.

* No need to time kernels for now.

* Fixup comments

* Pass struct args to Gridwise Run() function by reference.

* Don't use workspace memory in the case where A needs explicit transposition but B does not.

* Move calculation of rotating memory buffer sizes to Argument member functions.

* After the convolution to gemm transformation, the resulting 2D tensor descriptors are not necessarily RowMajor or ColumnMajor, so things should not rely on this distinction. Therefore, pass all RowMajor to the Gridwise and use a special version of CheckValidity that does not rely on 2D tensor layouts.

* Unify xdl and wmma example code for grouped conv fwd scaleadd ab

* Go back to passing RCR 2D tensor layouts to gridwise gemm, and use CRC for the CTranspose case. Also remove the special convolution version of checkValidity(). It seems like no matter what 2D tensor layouts you pass to the gridwise gemm, and no matter if you are using extraMN, and no matter if you are using the convolution version of checkvalidity, the results of all tests are the same.

* Add wmma scaleadd ab instances to the device factory and add a completely new scaleadd_ab gtest test for wmma cshufflev3 and xdl. Currently there is no profiler for scaleadd_ab so I made my own inside the test. Furthermore for XDL only the (NDHWGC, GKZYXC, NDHWGK) layout combination existed in the instance factory so that is the only one I added for wmma cshufflev3 and the gtest test as well. Another layout is tested in example 62, for xdl and wmma cshufflev3.

* Add support for V3 pipeline (tested). To be able to support num_loop < 3 we need the fixes from the batched gemm gemm MR which was already merged upstream, so just need to rebase or merge.

* Small post-merge fixup, everything seems to work.

* Do not build or run Xdl operations with Wmma backend for now. Will be reverted before upstreaming.

* Extend scaleadd_ab instance lists

* Extend merged groups instance lists, including adaptations of xdl "2x" instances.

* Extend "comp" instance lists, including "2x" and "part2" instances. 2x instances disabled for now since they do not compile.

* Extend "mem" instance lists.

* Extend regular instance lists.

* Fixup comments and ignored kernel arg name

* Properly use the splitN offsets for D tensors in the gridwise Run() function. Was necessary to pass the bias_clamp_large_cases test.

* Make sure all strides in ComputePtrOffset are at least value initialized to avoid undefined strides. Not convinced this struct is properly initialized in other code / future code.

* Re-enable sharding for wmma cshufflev3 instances

* Post merge fix to vanilla test

* Optionally allow num_k_loop <= PrefetchStages in gridwise CheckValidity. Use this for grouped conv fwd but not in general.

* Remove spurious ck_tile changes that were presumably introduced somewhere in the repeated merging from develop.

* Post-merge fixes. Make sure the new gridwise gemm wmma v3 common Run function can be used. Remove splitK, and forceThreadTileTransfer for now. Also add CShuffle epilogue argument.

* Disable FP8 / BF8 testing on CDNA1/2, it doesn't work anymore and needs to be either fixed or removed.

* Re-enable old wmma instances

* Re-enable Linqun's Xdl Wmma instances

* Small post-merge fixes

* Fix copyright headers

* Remove commented code snippet in gridwise

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Limit the explicit cast added in threadwise_tensor_slice_transfer_v7r3 to only be used for f8, just in case it hurts performance.

* Adding tuned instace list for groupoed conv fwd (#3288)

Following flavors are updated with tuned instance list:
  - grouped_conv2d_fwd
  - grouped_conv2d_fwd_bias_clamp
  - grouped_conv2d_fwd_clamp
  - grouped_conv3d_fwd
  - grouped_conv3d_fwd_bias_clamp
  - grouped_conv3d_fwd_clamp
  - grouped_conv3d_fwd_scaleadd_ab

Re-factored instance selection:
  - removed all the unnecessary instance tuples (comp/mem/16x16/generic)
  - removed all unnecessary layouts and data types

* Do not use std::remove_cvref_t, does not exist in C++17, use custom one.

* Splitting grouped conv fwd instances (#3449)

* Disable unnecessary and failing tests related to experimental CK builder

* Disable unnecessary ck builder experimental tests fully

* Adding extra flavors for grouped conv fwd

As titled. Following variants are added:
- grouped_conv3d_fwd_bilinear
- grouped_conv3d_fwd_scale

* fix cmake error

* Fix failing int8 test for DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle

---------

Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
Co-authored-by: apoorva <apoorva@streamhpc.com>
Co-authored-by: Anton Gorenko <anton@streamhpc.com>
Co-authored-by: Zoltan Lakatos <zoltan.lakatos@streamhpc.com>
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
Co-authored-by: Robin Voetter <robin@streamhpc.com>
Co-authored-by: Enrico Degregori <enrico@streamhpc.com>
Co-authored-by: Kiefer van Teutem <kiefer.van.teutem@streamhpc.com>
Co-authored-by: Kiefer van Teutem <50830967+krithalith@users.noreply.github.com>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants