Optimize FBGEMM Triton MX4 Quantize #2838

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

jwfromm wants to merge 4 commits into pytorch:main from jwfromm:export-D59688150

Contributor

jwfromm commented Jul 12, 2024

Summary:
We apply a similar technique as we did to dequantize in D59661776 to MX4 quantization. Specifically we do fancy indexing to be able to write both exponents and values to the same output tensor within the triton kernel. This allows us to only allocate a single output and do no extra copies, giving a sizeable 40% performance boost.

Before this change:

INFO:root:input size: 1073741824 group size: 32
INFO:root:Start to benchmark ...
INFO:root:Start to benchmark ...
input_size=1073741824 MX4 quantized time per iter: 7563us
input_size=1073741824 MX4 dequantized time per iter: 2756us
INFO:root:Start to benchmark ...
INFO:root:Start to benchmark ...
input_size=1073741824 MX4 triton quantized time per iter: 5110us
input_size=1073741824 MX4 triton dequantized time per iter: 2417us
INFO:root:Start to benchmark ...
INFO:root:Start to benchmark ...
input_size=1073741824 FP8 quantized time per iter: 6274us
input_size=1073741824 FP8 dequantized time per iter: 4223us

After this change:

INFO:root:input size: 1073741824 group size: 32
INFO:root:Start to benchmark ...
INFO:root:Start to benchmark ...
input_size=1073741824 MX4 quantized time per iter: 7560us
input_size=1073741824 MX4 dequantized time per iter: 2758us
INFO:root:Start to benchmark ...
INFO:root:Start to benchmark ...
input_size=1073741824 MX4 triton quantized time per iter: 3138us
input_size=1073741824 MX4 triton dequantized time per iter: 2418us
INFO:root:Start to benchmark ...
INFO:root:Start to benchmark ...
input_size=1073741824 FP8 quantized time per iter: 6274us
input_size=1073741824 FP8 dequantized time per iter: 4226us

Differential Revision: D59688150

facebook-github-bot added the cla signed label

netlify bot commented Jul 12, 2024 •

edited

Loading

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`fe95d06`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/66918ed0182d770008d4e34c
😎 Deploy Preview	https://deploy-preview-2838--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Contributor

facebook-github-bot commented Jul 12, 2024

This pull request was exported from Phabricator. Differential Revision: D59688150

facebook-github-bot added the fb-exported label

Contributor

facebook-github-bot commented Jul 12, 2024

This pull request was exported from Phabricator. Differential Revision: D59688150

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request


          Optimize FBGEMM Triton MX4 Quantize (pytorch#2838)

4d1f5ed

Summary:
Pull Request resolved: pytorch#2838

We apply a similar technique as we did to dequantize in D59661776 to MX4 quantization. Specifically we do fancy indexing to be able to write both exponents and values to the same output tensor within the triton kernel. This allows us to only allocate a single output and do no extra copies, giving a sizeable 40% performance boost.

Before this change:
```
INFO:root:input size: 1073741824 group size: 32
INFO:root:Start to benchmark ...
INFO:root:Start to benchmark ...
input_size=1073741824 MX4 quantized time per iter: 7563us
input_size=1073741824 MX4 dequantized time per iter: 2756us
INFO:root:Start to benchmark ...
INFO:root:Start to benchmark ...
input_size=1073741824 MX4 triton quantized time per iter: 5110us
input_size=1073741824 MX4 triton dequantized time per iter: 2417us
INFO:root:Start to benchmark ...
INFO:root:Start to benchmark ...
input_size=1073741824 FP8 quantized time per iter: 6274us
input_size=1073741824 FP8 dequantized time per iter: 4223us
```

After this change:
```
INFO:root:input size: 1073741824 group size: 32
INFO:root:Start to benchmark ...
INFO:root:Start to benchmark ...
input_size=1073741824 MX4 quantized time per iter: 7560us
input_size=1073741824 MX4 dequantized time per iter: 2758us
INFO:root:Start to benchmark ...
INFO:root:Start to benchmark ...
input_size=1073741824 MX4 triton quantized time per iter: 3138us
input_size=1073741824 MX4 triton dequantized time per iter: 2418us
INFO:root:Start to benchmark ...
INFO:root:Start to benchmark ...
input_size=1073741824 FP8 quantized time per iter: 6274us
input_size=1073741824 FP8 dequantized time per iter: 4226us
```

Differential Revision: D59688150

jwfromm force-pushed the export-D59688150 branch from d3a7577 to 4d1f5ed Compare

July 12, 2024 19:58

Contributor

facebook-github-bot commented Jul 12, 2024

This pull request was exported from Phabricator. Differential Revision: D59688150


          Use better exponent rounding in Triton MX4 quantize kernel (pytorch#2816

49d8bc6

)

Summary:
X-link: facebookresearch/FBGEMM#20

Pull Request resolved: pytorch#2816

As noted in [this doc](https://docs.google.com/document/d/156Du0hBRH6umG_i-OrYC574XhpQMUU5SJYG0RTS2tTg/edit#heading=h.akfcp7xpg8cr), using a ceiling round for scale calculation does a better job of not truncating some mantissa bits. This diff switches triton's floor rounding to ceil rounding.

Note that currently mx4_test doesnt pass as the cuda kernel now has different behavior than triton. Once we rebase this diff onto a similar change to the cuda kernel, we should see exact matching outputs again.

Differential Revision: D59527463

Reviewed By: jianyuh

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request


          Optimize FBGEMM Triton MX4 Quantize (pytorch#2838)

1503c35

Summary:
Pull Request resolved: pytorch#2838

We apply a similar technique as we did to dequantize in D59661776 to MX4 quantization. Specifically we do fancy indexing to be able to write both exponents and values to the same output tensor within the triton kernel. This allows us to only allocate a single output and do no extra copies, giving a sizeable 40% performance boost.

Before this change:
```
INFO:root:input size: 1073741824 group size: 32
INFO:root:Start to benchmark ...
INFO:root:Start to benchmark ...
input_size=1073741824 MX4 quantized time per iter: 7563us
input_size=1073741824 MX4 dequantized time per iter: 2756us
INFO:root:Start to benchmark ...
INFO:root:Start to benchmark ...
input_size=1073741824 MX4 triton quantized time per iter: 5110us
input_size=1073741824 MX4 triton dequantized time per iter: 2417us
INFO:root:Start to benchmark ...
INFO:root:Start to benchmark ...
input_size=1073741824 FP8 quantized time per iter: 6274us
input_size=1073741824 FP8 dequantized time per iter: 4223us
```

After this change:
```
INFO:root:input size: 1073741824 group size: 32
INFO:root:Start to benchmark ...
INFO:root:Start to benchmark ...
input_size=1073741824 MX4 quantized time per iter: 7560us
input_size=1073741824 MX4 dequantized time per iter: 2758us
INFO:root:Start to benchmark ...
INFO:root:Start to benchmark ...
input_size=1073741824 MX4 triton quantized time per iter: 3138us
input_size=1073741824 MX4 triton dequantized time per iter: 2418us
INFO:root:Start to benchmark ...
INFO:root:Start to benchmark ...
input_size=1073741824 FP8 quantized time per iter: 6274us
input_size=1073741824 FP8 dequantized time per iter: 4226us
```

Differential Revision: D59688150

jwfromm force-pushed the export-D59688150 branch from 4d1f5ed to 1503c35 Compare

July 12, 2024 20:04

Josh Fromm added 2 commits

July 12, 2024 13:04


          Refactor MX4 Kernel to operate on flat tensors (pytorch#2836)

d1f21e1

Summary:
Pull Request resolved: pytorch#2836

Rather than try to reshape inputs to 2D matrices with each thread operating on one row, this refactor uses 1D inputs and has each thread operate on an offset of the array.

The main benefit of this is that it avoid ragged tensors where we cant divide an input into even sized rows. This should enable us to be compatible with more shapes.

Differential Revision: D59653809

Reviewed By: sryap


          Optimize FBGEMM Triton MX4 Dequantize

b86b118

Summary:
We previously had to use python to unravel values from exponents and feed them to triton as two separate tensors. This introduced a lot of overhead as it introduced large copies.

This diff does a bunch of fancy indexing to directly operate on a tensor with mixed elements and exponents. The result is that triton dequantize is now slightly faster than the cuda kernel. My hope is that this allows us to standardize on a single implementation.

Differential Revision: D59661776

Contributor

facebook-github-bot commented Jul 12, 2024

This pull request was exported from Phabricator. Differential Revision: D59688150

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request


          Optimize FBGEMM Triton MX4 Quantize (pytorch#2838)

fd15d1f

Summary:
Pull Request resolved: pytorch#2838

We apply a similar technique as we did to dequantize in D59661776 to MX4 quantization. Specifically we do fancy indexing to be able to write both exponents and values to the same output tensor within the triton kernel. This allows us to only allocate a single output and do no extra copies, giving a sizeable 40% performance boost.

Before this change:
```
INFO:root:input size: 1073741824 group size: 32
INFO:root:Start to benchmark ...
INFO:root:Start to benchmark ...
input_size=1073741824 MX4 quantized time per iter: 7563us
input_size=1073741824 MX4 dequantized time per iter: 2756us
INFO:root:Start to benchmark ...
INFO:root:Start to benchmark ...
input_size=1073741824 MX4 triton quantized time per iter: 5110us
input_size=1073741824 MX4 triton dequantized time per iter: 2417us
INFO:root:Start to benchmark ...
INFO:root:Start to benchmark ...
input_size=1073741824 FP8 quantized time per iter: 6274us
input_size=1073741824 FP8 dequantized time per iter: 4223us
```

After this change:
```
INFO:root:input size: 1073741824 group size: 32
INFO:root:Start to benchmark ...
INFO:root:Start to benchmark ...
input_size=1073741824 MX4 quantized time per iter: 7560us
input_size=1073741824 MX4 dequantized time per iter: 2758us
INFO:root:Start to benchmark ...
INFO:root:Start to benchmark ...
input_size=1073741824 MX4 triton quantized time per iter: 3138us
input_size=1073741824 MX4 triton dequantized time per iter: 2418us
INFO:root:Start to benchmark ...
INFO:root:Start to benchmark ...
input_size=1073741824 FP8 quantized time per iter: 6274us
input_size=1073741824 FP8 dequantized time per iter: 4226us
```

Differential Revision: D59688150

jwfromm force-pushed the export-D59688150 branch from 1503c35 to fd15d1f Compare

July 12, 2024 20:09


          Optimize FBGEMM Triton MX4 Quantize (pytorch#2838)

fe95d06

Summary:
Pull Request resolved: pytorch#2838

We apply a similar technique as we did to dequantize in D59661776 to MX4 quantization. Specifically we do fancy indexing to be able to write both exponents and values to the same output tensor within the triton kernel. This allows us to only allocate a single output and do no extra copies, giving a sizeable 40% performance boost.

Before this change:
```
INFO:root:input size: 1073741824 group size: 32
INFO:root:Start to benchmark ...
INFO:root:Start to benchmark ...
input_size=1073741824 MX4 quantized time per iter: 7563us
input_size=1073741824 MX4 dequantized time per iter: 2756us
INFO:root:Start to benchmark ...
INFO:root:Start to benchmark ...
input_size=1073741824 MX4 triton quantized time per iter: 5110us
input_size=1073741824 MX4 triton dequantized time per iter: 2417us
INFO:root:Start to benchmark ...
INFO:root:Start to benchmark ...
input_size=1073741824 FP8 quantized time per iter: 6274us
input_size=1073741824 FP8 dequantized time per iter: 4223us
```

After this change:
```
INFO:root:input size: 1073741824 group size: 32
INFO:root:Start to benchmark ...
INFO:root:Start to benchmark ...
input_size=1073741824 MX4 quantized time per iter: 7560us
input_size=1073741824 MX4 dequantized time per iter: 2758us
INFO:root:Start to benchmark ...
INFO:root:Start to benchmark ...
input_size=1073741824 MX4 triton quantized time per iter: 3138us
input_size=1073741824 MX4 triton dequantized time per iter: 2418us
INFO:root:Start to benchmark ...
INFO:root:Start to benchmark ...
input_size=1073741824 FP8 quantized time per iter: 6274us
input_size=1073741824 FP8 dequantized time per iter: 4226us
```

Differential Revision: D59688150

Contributor

facebook-github-bot commented Jul 12, 2024

This pull request was exported from Phabricator. Differential Revision: D59688150

jwfromm force-pushed the export-D59688150 branch from fd15d1f to fe95d06 Compare

July 12, 2024 20:15

facebook-github-bot closed this in

e28a151

facebook-github-bot added the Merged label

Contributor

facebook-github-bot commented Jul 14, 2024

This pull request has been merged in e28a151.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed fb-exported Merged