Enable preshuffled mixed dtype Cutlass Gemm #3722

jwfromm · 2025-02-21T20:10:19Z

Enable new preshuffled FP8 x I4 kernels. These are the most performant mixed dtype kernels to date and dramatically outperform prior approaches including those in FBGEMM, marlin, and Machete.

Differential Revision: D69955197

facebook-github-bot · 2025-02-21T20:10:28Z

This pull request was exported from Phabricator. Differential Revision: D69955197

netlify · 2025-02-21T20:10:41Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`b225a2a`
🔍 Latest deploy log	https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/67cb5e4a023de300088a3c4e
😎 Deploy Preview	https://deploy-preview-3722--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Summary: WIP to enable new optimized preshuffled fp8xint4 gemm. Differential Revision: D69955197

facebook-github-bot · 2025-02-21T21:40:55Z

This pull request was exported from Phabricator. Differential Revision: D69955197

Summary: WIP to enable new optimized preshuffled fp8xint4 gemm. While the example compiles and runs, it runs into a variety of problems. The outputs are either completely incorrect, contain NaNs, or the kernel hits an Illegal Memory Access. I'm not yet sure why. Differential Revision: D69955197

facebook-github-bot · 2025-02-21T23:55:13Z

This pull request was exported from Phabricator. Differential Revision: D69955197

jwfromm · 2025-02-21T23:56:19Z

@IwakuraRein Despite this compiling and running, I'm getting incorrect outputs and very poor performance (even slower than the legacy f8i4 without packing or shuffling). Can you take a look and see if I'm doing something obviously wrong?

Ignore files besides f8i4_shuffled.cu and mixed_dtype_utils.cu as the others just fix cutlass v3.8 compatibility.

Summary: WIP to enable new optimized preshuffled fp8xint4 gemm. While the example compiles and runs, it runs into a variety of problems. The outputs are either completely incorrect, contain NaNs, or the kernel hits an Illegal Memory Access. I'm not yet sure why. Differential Revision: D69955197

facebook-github-bot · 2025-02-22T00:52:28Z

This pull request was exported from Phabricator. Differential Revision: D69955197

Summary: WIP to enable new optimized preshuffled fp8xint4 gemm. While the example compiles and runs, it runs into a variety of problems. The outputs are either completely incorrect, contain NaNs, or the kernel hits an Illegal Memory Access. I'm not yet sure why. Differential Revision: D69955197

facebook-github-bot · 2025-02-22T01:15:22Z

This pull request was exported from Phabricator. Differential Revision: D69955197

IwakuraRein · 2025-02-25T19:46:49Z

@jwfromm Are there negative values in the scale factors? This might be the reason for the accuracy drop after enabling lookup table, and can be easily fixed by applying this change to external/cutlass/include/cutlass/gemm/collective/sm90_mma_tma_gmma_rs_warpspecialized_mixed_input.hpp in your fork.

jwfromm · 2025-02-25T21:22:08Z

@IwakuraRein The scales are all positive and I'm running with the latest cutlass head commit (as of yesterday). The link you posted doesnt seem to include any changes to sm90_mma_tma_gmma_rs_warpspecialized_mixed_input.hpp, did you mean to paste a differint one?

IwakuraRein · 2025-02-26T00:35:24Z

@jwfromm Sorry I mean the changes in include/cutlass/detail/collective/mixed_input_utils.hpp in that link. But since your scales are all positive and I'm running with the latest cutlass then I guess this is not the issue.

IwakuraRein · 2025-03-03T18:19:45Z

fbgemm_gpu/experimental/gen_ai/bench/quantize_ops.py:1145:

-     scales = scales.view(x.shape[0], -1)
+     scales = scales.view(x.shape[0], -1).t().contiguous()

fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/mixed_dtype_utils.cu:59:

- StrideB stride_B;
+ StrideB stride_B = cutlass::make_cute_packed_stride(StrideB{}, shape_B);

These should fix the bugs.

Summary: Enable new preshuffled FP8 x I4 kernels. These are the most performant mixed dtype kernels to date and dramatically outperform prior approaches including those in FBGEMM, marlin, and Machete. Differential Revision: D69955197

facebook-github-bot · 2025-03-05T01:48:50Z

This pull request was exported from Phabricator. Differential Revision: D69955197

Summary: X-link: facebookresearch/FBGEMM#846 Enable new preshuffled FP8 x I4 kernels. These are the most performant mixed dtype kernels to date and dramatically outperform prior approaches including those in FBGEMM, marlin, and Machete. Differential Revision: D69955197

Summary: X-link: facebookresearch/FBGEMM#846 Enable new preshuffled FP8 x I4 kernels. These are the most performant mixed dtype kernels to date and dramatically outperform prior approaches including those in FBGEMM, marlin, and Machete. Reviewed By: jiawenliu64 Differential Revision: D69955197

Summary: Pull Request resolved: pytorch#3722 WIP to enable new optimized preshuffled fp8xint4 gemm. Differential Revision: D69955197

Summary: X-link: facebookresearch/FBGEMM#846 Enable new preshuffled FP8 x I4 kernels. These are the most performant mixed dtype kernels to date and dramatically outperform prior approaches including those in FBGEMM, marlin, and Machete. Reviewed By: jiawenliu64 Differential Revision: D69955197

Summary: Pull Request resolved: pytorch#3722 WIP to enable new optimized preshuffled fp8xint4 gemm. Differential Revision: D69955197

Summary: X-link: facebookresearch/FBGEMM#846 Enable new preshuffled FP8 x I4 kernels. These are the most performant mixed dtype kernels to date and dramatically outperform prior approaches including those in FBGEMM, marlin, and Machete. Reviewed By: jiawenliu64 Differential Revision: D69955197

facebook-github-bot · 2025-03-06T18:27:53Z

This pull request was exported from Phabricator. Differential Revision: D69955197

Summary: X-link: facebookresearch/FBGEMM#846 Enable new preshuffled FP8 x I4 kernels. These are the most performant mixed dtype kernels to date and dramatically outperform prior approaches including those in FBGEMM, marlin, and Machete. Reviewed By: jiawenliu64 Differential Revision: D69955197

facebook-github-bot · 2025-03-07T17:32:32Z

This pull request was exported from Phabricator. Differential Revision: D69955197

Summary: X-link: facebookresearch/FBGEMM#846 Enable new preshuffled FP8 x I4 kernels. These are the most performant mixed dtype kernels to date and dramatically outperform prior approaches including those in FBGEMM, marlin, and Machete. Reviewed By: jiawenliu64 Differential Revision: D69955197

Summary: Pull Request resolved: pytorch#3722 WIP to enable new optimized preshuffled fp8xint4 gemm. Differential Revision: D69955197

Summary: Enable new preshuffled FP8 x I4 kernels. These are the most performant mixed dtype kernels to date and dramatically outperform prior approaches including those in FBGEMM, marlin, and Machete. Differential Revision: D69955197

facebook-github-bot · 2025-03-08T01:22:19Z

This pull request has been merged in 27724d9.

Summary: Pull Request resolved: facebookresearch/FBGEMM#846 X-link: pytorch#3722 Enable new preshuffled FP8 x I4 kernels. These are the most performant mixed dtype kernels to date and dramatically outperform prior approaches including those in FBGEMM, marlin, and Machete. Reviewed By: jiawenliu64 Differential Revision: D69955197 fbshipit-source-id: 151b5dd96728b82b4bf5a4c3967310d055d56094

facebook-github-bot added the cla signed label Feb 21, 2025

facebook-github-bot added the fb-exported label Feb 21, 2025

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Feb 21, 2025

Enable preshuffled mixed dtype Cutlass Gemm (pytorch#3722)

24a824b

Summary: WIP to enable new optimized preshuffled fp8xint4 gemm. Differential Revision: D69955197

jwfromm force-pushed the export-D69955197 branch from 254c644 to 24a824b Compare February 21, 2025 21:40

jwfromm force-pushed the export-D69955197 branch from 24a824b to dc741e7 Compare February 21, 2025 23:55

jwfromm force-pushed the export-D69955197 branch from dc741e7 to 70477ce Compare February 22, 2025 00:52

jwfromm force-pushed the export-D69955197 branch from 70477ce to bbca782 Compare February 22, 2025 01:15

jwfromm force-pushed the export-D69955197 branch from bbca782 to 15a5738 Compare March 5, 2025 01:48

jwfromm pushed a commit to jwfromm/FBGEMM that referenced this pull request Mar 5, 2025

Enable preshuffled mixed dtype Cutlass Gemm (pytorch#3722)

91698ed

Summary: Pull Request resolved: pytorch#3722 WIP to enable new optimized preshuffled fp8xint4 gemm. Differential Revision: D69955197

jwfromm force-pushed the export-D69955197 branch from 15a5738 to 3e2ddfa Compare March 6, 2025 18:27

jwfromm force-pushed the export-D69955197 branch from 3e2ddfa to 8407869 Compare March 7, 2025 17:32

jwfromm pushed a commit to jwfromm/FBGEMM that referenced this pull request Mar 7, 2025

Enable preshuffled mixed dtype Cutlass Gemm (pytorch#3722)

5a903ff

Summary: Pull Request resolved: pytorch#3722 WIP to enable new optimized preshuffled fp8xint4 gemm. Differential Revision: D69955197

jwfromm force-pushed the export-D69955197 branch from 8407869 to b225a2a Compare March 7, 2025 20:59

facebook-github-bot closed this in 27724d9 Mar 8, 2025

facebook-github-bot added the Merged label Mar 8, 2025

q10 added category:new feature:genai labels Mar 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable preshuffled mixed dtype Cutlass Gemm #3722

Enable preshuffled mixed dtype Cutlass Gemm #3722

jwfromm commented Feb 21, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented Feb 21, 2025

Uh oh!

netlify bot commented Feb 21, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented Feb 21, 2025

Uh oh!

facebook-github-bot commented Feb 21, 2025

Uh oh!

jwfromm commented Feb 21, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented Feb 22, 2025

Uh oh!

facebook-github-bot commented Feb 22, 2025

Uh oh!

IwakuraRein commented Feb 25, 2025

Uh oh!

jwfromm commented Feb 25, 2025 •

edited

Loading

Uh oh!

IwakuraRein commented Feb 26, 2025

Uh oh!

IwakuraRein commented Mar 3, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented Mar 5, 2025

Uh oh!

facebook-github-bot commented Mar 6, 2025

Uh oh!

facebook-github-bot commented Mar 7, 2025

Uh oh!

facebook-github-bot commented Mar 8, 2025

Uh oh!

Uh oh!

Enable preshuffled mixed dtype Cutlass Gemm #3722

Enable preshuffled mixed dtype Cutlass Gemm #3722

Conversation

jwfromm commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Feb 21, 2025

Uh oh!

netlify bot commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Uh oh!

facebook-github-bot commented Feb 21, 2025

Uh oh!

facebook-github-bot commented Feb 21, 2025

Uh oh!

jwfromm commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Feb 22, 2025

Uh oh!

facebook-github-bot commented Feb 22, 2025

Uh oh!

IwakuraRein commented Feb 25, 2025

Uh oh!

jwfromm commented Feb 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IwakuraRein commented Feb 26, 2025

Uh oh!

IwakuraRein commented Mar 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Mar 5, 2025

Uh oh!

facebook-github-bot commented Mar 6, 2025

Uh oh!

facebook-github-bot commented Mar 7, 2025

Uh oh!

facebook-github-bot commented Mar 8, 2025

Uh oh!

Uh oh!

jwfromm commented Feb 21, 2025 •

edited

Loading

netlify bot commented Feb 21, 2025 •

edited

Loading

jwfromm commented Feb 21, 2025 •

edited

Loading

jwfromm commented Feb 25, 2025 •

edited

Loading

IwakuraRein commented Mar 3, 2025 •

edited

Loading