[ET-VK] Efficient tiled int8 matmul #9766

SS-JIA · 2025-03-31T16:09:17Z

Stack from ghstack (oldest at bottom):

Context

Introduce a optimized tiled implementation for computing the weight int8-quantized linear operation.

This implementation takes advantage of the following principles to squeeze out performance:

Compute an output tile with each thread, rather than a single output element. This allows for better memory re-use of loaded input tensor data.
Compute the output tile by iteratively loading tiles of the input matrices, caching them in registers, and then performing the fma accumulations to obtain a partial output. By splitting the data loading and computation into distinct steps, the GPU is able to perform latency hiding more effectively, i.e. switching to a warp that needs to perform compute when the current warp is waiting on data load
Use a work group size of {N, 1, 1}. This makes it so that all the threads in a work group load the same row of the input matrx, and consecutive columns of the weight matrix. This way, the row of the input is kept hot in the cache, and accesses to the weight matrix can be coalesced due to the previous diff un-transposing the weight matrix.

Differential Revision: D72066587

## Context Introduce a optimized tiled implementation for computing the weight int8-quantized linear operation. This implementation takes advantage of the following principles to squeeze out performance: * Compute an output tile with each thread, rather than a single output element. This allows for better memory re-use of loaded input tensor data. * Compute the output tile by iteratively loading tiles of the input matrices, caching them in registers, and then performing the `fma` accumulations to obtain a partial output. By splitting the data loading and computation into distinct steps, the GPU is able to perform latency hiding more effectively, i.e. switching to a warp that needs to perform compute when the current warp is waiting on data load * Use a work group size of `{N, 1, 1}`. This makes it so that all the threads in a work group load the same row of the input matrx, and consecutive columns of the weight matrix. This way, the row of the input is kept hot in the cache, and accesses to the weight matrix can be coalesced due to the previous diff un-transposing the weight matrix. Differential Revision: [D72066587](https://our.internmc.facebook.com/intern/diff/D72066587/) [ghstack-poisoned]

pytorch-bot · 2025-03-31T16:09:20Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9766

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Cancelled Job

As of commit 7a20e01 with merge base 2aa7748 ():

NEW FAILURES - The following jobs have failed:

Check Labels / Check labels (gh)
RuntimeError: Error checking labels: PR does not have required labels
pull / unittest-arm / linux-job (gh)
RuntimeError: Command docker exec -t 21649e01b30049b155aa3d5d54154631814a36ad5922fb8af584e942ae3aad6a /exec failed with exit code 1

CANCELLED JOB - The following job was cancelled. Please retry:

pull / test-static-llama-qnn-linux / linux-job (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

## Context Introduce a optimized tiled implementation for computing the weight int8-quantized linear operation. This implementation takes advantage of the following principles to squeeze out performance: * Compute an output tile with each thread, rather than a single output element. This allows for better memory re-use of loaded input tensor data. * Compute the output tile by iteratively loading tiles of the input matrices, caching them in registers, and then performing the `fma` accumulations to obtain a partial output. By splitting the data loading and computation into distinct steps, the GPU is able to perform latency hiding more effectively, i.e. switching to a warp that needs to perform compute when the current warp is waiting on data load * Use a work group size of `{N, 1, 1}`. This makes it so that all the threads in a work group load the same row of the input matrx, and consecutive columns of the weight matrix. This way, the row of the input is kept hot in the cache, and accesses to the weight matrix can be coalesced due to the previous diff un-transposing the weight matrix. Differential Revision: [D72066587](https://our.internmc.facebook.com/intern/diff/D72066587/) ghstack-source-id: 275129678 Pull Request resolved: #9766

facebook-github-bot · 2025-03-31T16:09:37Z

This pull request was exported from Phabricator. Differential Revision: D72066587

github-actions · 2025-03-31T16:09:48Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

## Context Introduce a optimized tiled implementation for computing the weight int8-quantized linear operation. This implementation takes advantage of the following principles to squeeze out performance: * Compute an output tile with each thread, rather than a single output element. This allows for better memory re-use of loaded input tensor data. * Compute the output tile by iteratively loading tiles of the input matrices, caching them in registers, and then performing the `fma` accumulations to obtain a partial output. By splitting the data loading and computation into distinct steps, the GPU is able to perform latency hiding more effectively, i.e. switching to a warp that needs to perform compute when the current warp is waiting on data load * Use a work group size of `{N, 1, 1}`. This makes it so that all the threads in a work group load the same row of the input matrx, and consecutive columns of the weight matrix. This way, the row of the input is kept hot in the cache, and accesses to the weight matrix can be coalesced due to the previous diff un-transposing the weight matrix. Differential Revision: [D72066587](https://our.internmc.facebook.com/intern/diff/D72066587/) [ghstack-poisoned]

Pull Request resolved: #9766 ## Context Introduce a optimized tiled implementation for computing the weight int8-quantized linear operation. This implementation takes advantage of the following principles to squeeze out performance: * Compute an output tile with each thread, rather than a single output element. This allows for better memory re-use of loaded input tensor data. * Compute the output tile by iteratively loading tiles of the input matrices, caching them in registers, and then performing the `fma` accumulations to obtain a partial output. By splitting the data loading and computation into distinct steps, the GPU is able to perform latency hiding more effectively, i.e. switching to a warp that needs to perform compute when the current warp is waiting on data load * Use a work group size of `{N, 1, 1}`. This makes it so that all the threads in a work group load the same row of the input matrx, and consecutive columns of the weight matrix. This way, the row of the input is kept hot in the cache, and accesses to the weight matrix can be coalesced due to the previous diff un-transposing the weight matrix. Differential Revision: [D72066587](https://our.internmc.facebook.com/intern/diff/D72066587/) ghstack-source-id: 275180032

facebook-github-bot · 2025-03-31T19:18:44Z

This pull request was exported from Phabricator. Differential Revision: D72066587

Pull Request resolved: #9766 ## Context Introduce a optimized tiled implementation for computing the weight int8-quantized linear operation. This implementation takes advantage of the following principles to squeeze out performance: * Compute an output tile with each thread, rather than a single output element. This allows for better memory re-use of loaded input tensor data. * Compute the output tile by iteratively loading tiles of the input matrices, caching them in registers, and then performing the `fma` accumulations to obtain a partial output. By splitting the data loading and computation into distinct steps, the GPU is able to perform latency hiding more effectively, i.e. switching to a warp that needs to perform compute when the current warp is waiting on data load * Use a work group size of `{N, 1, 1}`. This makes it so that all the threads in a work group load the same row of the input matrx, and consecutive columns of the weight matrix. This way, the row of the input is kept hot in the cache, and accesses to the weight matrix can be coalesced due to the previous diff un-transposing the weight matrix. Differential Revision: [D72066587](https://our.internmc.facebook.com/intern/diff/D72066587/)

SS-JIA mentioned this pull request Mar 31, 2025

[ET-VK] Store weights transposed for int8 linear #9765

Merged

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 31, 2025

facebook-github-bot added the fb-exported label Mar 31, 2025

trivedivivek approved these changes Apr 1, 2025

View reviewed changes

facebook-github-bot merged commit 219e746 into gh/SS-JIA/205/base Apr 1, 2025
80 of 84 checks passed

facebook-github-bot deleted the gh/SS-JIA/205/head branch April 1, 2025 16:14

facebook-github-bot temporarily deployed to cherry-pick-bot April 1, 2025 16:14 — with GitHub Actions Inactive

pytorchbot mentioned this pull request Apr 1, 2025

[ET-VK] Efficient tiled int8 matmul #9804

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ET-VK] Efficient tiled int8 matmul #9766

[ET-VK] Efficient tiled int8 matmul #9766

Uh oh!

SS-JIA commented Mar 31, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Mar 31, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented Mar 31, 2025

Uh oh!

github-actions bot commented Mar 31, 2025

Uh oh!

facebook-github-bot commented Mar 31, 2025

Uh oh!

Uh oh!

Uh oh!

[ET-VK] Efficient tiled int8 matmul #9766

[ET-VK] Efficient tiled int8 matmul #9766

Uh oh!

Conversation

SS-JIA commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Uh oh!

pytorch-bot bot commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/9766

❌ 2 New Failures, 1 Cancelled Job

Uh oh!

facebook-github-bot commented Mar 31, 2025

Uh oh!

github-actions bot commented Mar 31, 2025

This PR needs a release notes: label

Uh oh!

facebook-github-bot commented Mar 31, 2025

Uh oh!

Uh oh!

Uh oh!

SS-JIA commented Mar 31, 2025 •

edited

Loading

pytorch-bot bot commented Mar 31, 2025 •

edited

Loading

This PR needs a `release notes:` label