Skip to content

Conversation

AlexAUT
Copy link
Contributor

@AlexAUT AlexAUT commented Oct 3, 2025

Adds support to the AMD pipeliner to compose PaddedSharedEncoding for ttg.async_copy_global_to_local on gfx950.

As described in #7929 padding alone cannot avoid bank conflicts on GFX9 because due to hardware design we can only add padding at warp boundaries 64 threads × 16 bytes = 1024 byte so in addition to padding we also reorder rows via the linearComponent of the PaddedSharedEncoding.

Rows are reordered to place 16 consecutive logical rows strided by 1024 bytes into shared memory. For instance, if each row is 256 bytes, the layout would look like:
[[row0], [row16], [row32], [row48], /*1024bytes*/ [row1], [row17], [row33], [row49], /*2048bytes*/ [row2], [row18] ...]

The aim of those layouts is to reduce register pressure and instruction count compared to swizzled layouts at the expense of a slightly increased LDS memory footprint.
This PR includes support for dtype==16bits and the tensor size is >= 16KB.

Copy link
Collaborator

@antiagainst antiagainst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@antiagainst antiagainst marked this pull request as ready for review October 3, 2025 17:59
@antiagainst antiagainst requested a review from zhanglx13 as a code owner October 3, 2025 17:59
@AlexAUT
Copy link
Contributor Author

AlexAUT commented Oct 6, 2025

Thank you for the quick review, I think I addressed all comments.

The mfma32 case produces some bank conflicts due to refactoring before opening the PR. I am not sure if we want to wait for the fixes or not. I will have a fix for it a bit later today.

@AlexAUT
Copy link
Contributor Author

AlexAUT commented Oct 6, 2025

37a9e80 fixes the bank conflicts for mfma32.

@antiagainst antiagainst merged commit ac0bb72 into triton-lang:main Oct 7, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants