Fix Triton GEMM templates with k=1 #158650

PaulZhang12 · 2025-07-18T16:09:12Z

Stack from ghstack (oldest at bottom):

-> Fix Triton GEMM templates with k=1 #158650

Thanks to @davidberard98 for much of the analysis here. For GEMMs of K=1, the hints, tl.multiple_of and tl.max_contiguous apply completely, as the indices to the loads are only dependent on offs_m and offs_n. For shapes like (97x1), (1x97), this results in misaligned address errors, due to the fact that for all BLOCK_M and BLOCK_N sizes, the last tile is not a contiguous load. With K > 1 case, the hint is not as strict given the dependency on the k indices for the load as well. In the K=1 case, only offs_m and offs_n are used and broadcasted to the index shape.

One can say these hints are "wrong", but in various cases in the hints being wrong, such as with the shape 9999x4, 4x9999, there is a substantial performance improvement with the hint.

For nice shapes with K=1, where M, N are a multiple 8 to where these hints are fine and there is no misaligned address, there is no performance regression observed on H100:

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben

[ghstack-poisoned]

pytorch-bot · 2025-07-18T16:09:16Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158650

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

⏳ 79 Pending, 2 Unrelated Failures

As of commit 1fdaf57 with merge base 036eb1f ():

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / cuda12.8-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu) (gh) (trunk failure)
MISSING REGRESSION TEST

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / linux-jammy-py3_9-clang9-xla / test (xla, 1, 1, linux.12xlarge, unstable) (gh) (#158876)
sccache: error: couldn't connect to server

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: e0f6f47 Pull Request resolved: #158650

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]

ghstack-source-id: ac8bb42 Pull Request resolved: #158650

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]

ghstack-source-id: 6abd628 Pull Request resolved: #158650

Thanks to davidberard98 for much of the analysis here. For GEMMs of K=1, the hints, `tl.multiple_of` and `tl.max_contiguous` apply completely, as the indices to the loads are only dependent on `offs_m` and `offs_n`. For shapes like `(97x1), (1x97)`, this results in misaligned address errors, due to the fact that for all BLOCK_M and BLOCK_N sizes, the last tile is not a contiguous load. With K > 1 case, the hint is not as strict given the dependency on the k indices for the load as well. In the K=1 case, only `offs_m` and `offs_n` are used and broadcasted to the index shape. For nice shapes with K=1, where M, N are a multiple 8 to where these hints are fine and there is no misaligned address, there is no performance regression observed on H100: <img width="547" height="402" alt="Screenshot 2025-07-18 at 5 05 47 PM" src="https://github.com/user-attachments/assets/fee2bbaa-784c-422e-bb8c-43c6c2607ad2" /> cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]

ghstack-source-id: b07e6e7 Pull Request resolved: #158650

Thanks to davidberard98 for much of the analysis here. For GEMMs of K=1, the hints, `tl.multiple_of` and `tl.max_contiguous` apply completely, as the indices to the loads are only dependent on `offs_m` and `offs_n`. For shapes like `(97x1), (1x97)`, this results in misaligned address errors, due to the fact that for all BLOCK_M and BLOCK_N sizes, the last tile is not a contiguous load. With K > 1 case, the hint is not as strict given the dependency on the k indices for the load as well. In the K=1 case, only `offs_m` and `offs_n` are used and broadcasted to the index shape. One can say these hints are "wrong", but in various cases in the hints being wrong, such as with the shape `9999x4, 4x9999`, there is a substantial performance improvement with the hint. For nice shapes with K=1, where M, N are a multiple 8 to where these hints are fine and there is no misaligned address, there is no performance regression observed on H100: <img width="547" height="402" alt="Screenshot 2025-07-18 at 5 05 47 PM" src="https://github.com/user-attachments/assets/fee2bbaa-784c-422e-bb8c-43c6c2607ad2" /> cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]

ghstack-source-id: 475784c Pull Request resolved: #158650

eellison · 2025-07-22T17:10:38Z

torch/_inductor/kernel/mm.py


    rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
    rn = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
-    if ((stride_am == 1 and stride_ak == M) or (stride_am == K and stride_ak == 1)) and M >= BLOCK_M:


Would you mind clarifying For GEMMs of K=1, the hints, tl.multiple_of and tl.max_contiguous apply completely, as the indices to the loads are only dependent on offs_m and offs_n - if they apply completely, why are we skipping the check? or maybe i'm misunderstanding.

For K != 1, the inductor-generated code in the k loop looks like this:

a_mask = offs_k[None, :] < (K - k_idx * BLOCK_K) b_mask = offs_k[:, None] < (K - k_idx * BLOCK_K) a_k_idx_vals = offs_k[None, :] + (k_idx * BLOCK_K) b_k_idx_vals = offs_k[:, None] + (k_idx * BLOCK_K) idx_m = offs_a_m[:, None] idx_n = a_k_idx_vals xindex = idx_n + 3*idx_m a = tl.load(A + (xindex), mask=a_mask, other=0.0) idx_m = b_k_idx_vals idx_n = offs_b_n[None, :] xindex = idx_n + 97*idx_m b = tl.load(B + (xindex), mask=b_mask, other=0.0)

So if we're loading a block of data [BLOCK_K, BLOCK_N], then although we assume contiguity/divisibility of indices[:, 0] (due to the hints), we can't assume divisibility of indices[:, 1] due to the addition of the idx_m * 97, where 97 has no divisibility properties - therefore no vectorization is performed.

But for K == 1, the inductor generated code looks like this

xindex = idx_m + idx_n a = tl.load(A + ((tl.broadcast_to(idx_m, xindex.shape)).broadcast_to(xindex.shape)), mask=a_mask, other=0.0) idx_m = b_k_idx_vals idx_n = offs_b_n[None, :] xindex = idx_n + 97*idx_m b = tl.load(B + ((tl.broadcast_to(idx_n, xindex.shape)).broadcast_to(xindex.shape)), mask=b_mask, other=0.0)

Here, we explicitly remove the indexing in the k dimension (as we should, since there's no point in loading a BLOCK_K x BLOCK_M and masking out (BLOCK_K-1) rows, if K = 1). As a result, there's nothing hinting that vectorization is not possible, so Triton vectorizes the loads.

Note: the real problem is that the divisibility and contiguity hints are wrong. But they do provide perf improvement (I don't fully understand why) and usually don't cause incorrect behavior or crashes.

So this PR is helpful by preventing many of the crashes while preserving perf on the cases where the hints help.

PaulZhang12 · 2025-07-22T23:05:07Z

@pytorchbot merge

pytorchmergebot · 2025-07-22T23:06:57Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

ZainRizvi · 2025-07-23T17:40:08Z

@pytorchbot revert -c ghfirst -m "Sorry but this is breaking internally, see D78805560 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts"

pytorchmergebot · 2025-07-23T17:41:58Z

@pytorchbot successfully started a revert job. Check the current status here.
Questions? Feedback? Please reach out to the PyTorch DevX Team

This reverts commit 9df0f56. Reverted #158650 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, see D78805560 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](#158650 (comment)))

pytorchmergebot · 2025-07-23T17:42:15Z

@PaulZhang12 your PR has been successfully reverted.

ZainRizvi · 2025-07-23T18:43:49Z

@pytorchbot merge -f "Turns out the internal signal was flaky"

pytorchmergebot · 2025-07-23T18:45:32Z

Merge started

Your change will be merged immediately since you used the force (-f) flag, bypassing any CI checks (ETA: 1-5 minutes). Please use -f as last resort and instead consider -i/--ignore-current to continue the merge ignoring current failures. This will allow currently pending tests to finish and report signal before the merge.

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Fix Triton GEMM templates with k=1

694886c

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor labels Jul 18, 2025

PaulZhang12 added a commit that referenced this pull request Jul 18, 2025

Fix Triton GEMM templates with k=1

e36b778

ghstack-source-id: e0f6f47 Pull Request resolved: #158650

PaulZhang12 added the topic: not user facing topic category label Jul 18, 2025

Update on "Fix Triton GEMM templates with k=1"

ee64b8b

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]

PaulZhang12 added a commit that referenced this pull request Jul 18, 2025

Fix Triton GEMM templates with k=1

bb28b2d

ghstack-source-id: ac8bb42 Pull Request resolved: #158650

Update on "Fix Triton GEMM templates with k=1"

6ecb4eb

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy chenyang78 kadeng muchulee8 amjames chauhang aakhundov coconutruben [ghstack-poisoned]

PaulZhang12 added a commit that referenced this pull request Jul 18, 2025

Fix Triton GEMM templates with k=1

43d64e9

ghstack-source-id: 6abd628 Pull Request resolved: #158650

PaulZhang12 added a commit that referenced this pull request Jul 18, 2025

Fix Triton GEMM templates with k=1

81c833c

ghstack-source-id: b07e6e7 Pull Request resolved: #158650

PaulZhang12 added 2 commits July 21, 2025 06:22

PaulZhang12 added a commit that referenced this pull request Jul 21, 2025

Fix Triton GEMM templates with k=1

8936ec3

ghstack-source-id: 475784c Pull Request resolved: #158650

PaulZhang12 requested a review from eellison July 21, 2025 17:18

eellison reviewed Jul 22, 2025

View reviewed changes

eellison requested a review from davidberard98 July 22, 2025 17:10

davidberard98 approved these changes Jul 22, 2025

View reviewed changes

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jul 22, 2025

pytorchmergebot added the merging label Jul 22, 2025

pytorchmergebot added the Merged label Jul 23, 2025

pytorchmergebot closed this in 9df0f56 Jul 23, 2025

pytorchmergebot removed the merging label Jul 23, 2025

pytorchmergebot added Reverted ci-no-td Do not run TD on this PR labels Jul 23, 2025

pytorchmergebot reopened this Jul 23, 2025

pytorchmergebot added the merging label Jul 23, 2025

pytorchmergebot closed this in fe8f556 Jul 23, 2025

pytorchmergebot removed the merging label Jul 23, 2025

github-actions bot deleted the gh/PaulZhang12/18/head branch August 23, 2025 02:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix Triton GEMM templates with k=1 #158650

Fix Triton GEMM templates with k=1 #158650

Uh oh!

PaulZhang12 commented Jul 18, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jul 18, 2025 •

edited

Loading

Uh oh!

eellison Jul 22, 2025

Uh oh!

davidberard98 Jul 22, 2025 •

edited

Loading

Uh oh!

davidberard98 Jul 22, 2025

Uh oh!

PaulZhang12 commented Jul 22, 2025

Uh oh!

pytorchmergebot commented Jul 22, 2025

Uh oh!

ZainRizvi commented Jul 23, 2025

Uh oh!

pytorchmergebot commented Jul 23, 2025

Uh oh!

pytorchmergebot commented Jul 23, 2025

Uh oh!

ZainRizvi commented Jul 23, 2025

Uh oh!

pytorchmergebot commented Jul 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Fix Triton GEMM templates with k=1 #158650

Fix Triton GEMM templates with k=1 #158650

Uh oh!

Conversation

PaulZhang12 commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/158650

⏳ 79 Pending, 2 Unrelated Failures

Uh oh!

eellison Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

davidberard98 Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davidberard98 Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

PaulZhang12 commented Jul 22, 2025

Uh oh!

pytorchmergebot commented Jul 22, 2025

Merge started

Uh oh!

ZainRizvi commented Jul 23, 2025

Uh oh!

pytorchmergebot commented Jul 23, 2025

Uh oh!

pytorchmergebot commented Jul 23, 2025

Uh oh!

ZainRizvi commented Jul 23, 2025

Uh oh!

pytorchmergebot commented Jul 23, 2025

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

PaulZhang12 commented Jul 18, 2025 •

edited

Loading

pytorch-bot bot commented Jul 18, 2025 •

edited

Loading

davidberard98 Jul 22, 2025 •

edited

Loading