-
Notifications
You must be signed in to change notification settings - Fork 2.3k
[AMD][BACKEND] Use PaddedLayout with AsyncCopy on gfx950 when pipelining #8365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ing for the smem layout dev Bank conflict free layouts for mfma32 and mfma16 with and without transpose Bank conflict free kWidth=4 Fix debug prints Put padded composition to separate function
This reverts commit 415e24c.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Thank you for the quick review, I think I addressed all comments. The |
37a9e80 fixes the bank conflicts for |
Adds support to the AMD pipeliner to compose
PaddedSharedEncoding
forttg.async_copy_global_to_local
ongfx950
.As described in #7929 padding alone cannot avoid bank conflicts on
GFX9
because due to hardware design we can only add padding at warp boundaries64 threads × 16 bytes = 1024 byte
so in addition to padding we also reorder rows via thelinearComponent
of thePaddedSharedEncoding
.Rows are reordered to place 16 consecutive logical rows strided by 1024 bytes into shared memory. For instance, if each row is 256 bytes, the layout would look like:
[[row0], [row16], [row32], [row48], /*1024bytes*/ [row1], [row17], [row33], [row49], /*2048bytes*/ [row2], [row18] ...]
The aim of those layouts is to reduce register pressure and instruction count compared to swizzled layouts at the expense of a slightly increased LDS memory footprint.
This PR includes support for
dtype==16bits
and the tensor size is >= 16KB.