-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use order from A matrix when determining DPAS layout #2834
base: main
Are you sure you want to change the base?
Conversation
ce3f3c9
to
efc8e2a
Compare
The latest changes resolve performance regressions with
|
1b5a55a
to
976c1a1
Compare
I don't get why just look at the ordering attribute (layout) of matrix A to determine the DPAS layout. I thought we should look at both matrix A and B's layout and decide. |
…2956) Required for #2834 Two reasons to do this - one, it properly tags the layouts with their memory order very early in the TTGIR pipeline. And two, it moves our TTGIR pipeline closer to upstream. I am splitting the change to isolate any regressions or undesired behavior caused by this change vs changing the DPAS layouts in #2834. cc #2354
976c1a1
to
0ecff19
Compare
When we determined that oneDNN used a different layout for |
0ecff19
to
7eee7e3
Compare
One of the learnings from mapping oneDNN kernels to Triton layouts is that the
warpsPerCTA
attribute for the DPAS layout should be modified to match the A matrix configuration. If A is a row major matrix, warpsPerCTA should bias towards narrow blocks, presumably because narrow blocks better match the shape of the input matrix. If A is column major, warpsPerCTA should bias towards a tall matrix, because the fast-changing dimension of A is now the columns. For the GEMM case, this change has the effect of interspersing DPAS instructions and shuffles, which I believe is reducing memory pressure and resulting in better performance for AxBT. AxB is unchanged, presumably because the loads are quite efficient with minimal shuffles needed.To allow for the A matrix layout to affect the choice of DPAS layout I needed to have the A matrix layout properly convey the A matrix order. This info was recently introduced via AxisInfoAnalysis run during the Coalesce pass. Following the upstream pipeline convention, I moved Coalesce to the top of the TTGIR optimization pipeline which properly tags all blocked layouts with the correct order. Then we use the order for the A matrix when determining the row and column dimensions when mapping warps to tiles.
I plan on modifying the tile sizes for AxB.T (and the A.T matrices) but wanted to split this change out as it is relatively compact but does modify the pass pipeline.
GEMM Performance with this change:
Performance from main: