Skip to content

Enable preshuffled mixed dtype Cutlass Gemm #3722

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

jwfromm
Copy link
Contributor

@jwfromm jwfromm commented Feb 21, 2025

Enable new preshuffled FP8 x I4 kernels. These are the most performant mixed dtype kernels to date and dramatically outperform prior approaches including those in FBGEMM, marlin, and Machete.

Differential Revision: D69955197

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69955197

Copy link

netlify bot commented Feb 21, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit b225a2a
🔍 Latest deploy log https://app.netlify.com/sites/pytorch-fbgemm-docs/deploys/67cb5e4a023de300088a3c4e
😎 Deploy Preview https://deploy-preview-3722--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Feb 21, 2025
Summary:

WIP to enable new optimized preshuffled fp8xint4 gemm.

Differential Revision: D69955197
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69955197

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Feb 21, 2025
Summary:

WIP to enable new optimized preshuffled fp8xint4 gemm.

While the example compiles and runs, it runs into a variety of problems. The outputs are either completely incorrect, contain NaNs, or the kernel hits an Illegal Memory Access. I'm not yet sure why.

Differential Revision: D69955197
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69955197

@jwfromm
Copy link
Contributor Author

jwfromm commented Feb 21, 2025

@IwakuraRein Despite this compiling and running, I'm getting incorrect outputs and very poor performance (even slower than the legacy f8i4 without packing or shuffling). Can you take a look and see if I'm doing something obviously wrong?

Ignore files besides f8i4_shuffled.cu and mixed_dtype_utils.cu as the others just fix cutlass v3.8 compatibility.

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Feb 22, 2025
Summary:

WIP to enable new optimized preshuffled fp8xint4 gemm.

While the example compiles and runs, it runs into a variety of problems. The outputs are either completely incorrect, contain NaNs, or the kernel hits an Illegal Memory Access. I'm not yet sure why.

Differential Revision: D69955197
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69955197

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Feb 22, 2025
Summary:

WIP to enable new optimized preshuffled fp8xint4 gemm.

While the example compiles and runs, it runs into a variety of problems. The outputs are either completely incorrect, contain NaNs, or the kernel hits an Illegal Memory Access. I'm not yet sure why.

Differential Revision: D69955197
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69955197

@IwakuraRein
Copy link

@jwfromm Are there negative values in the scale factors? This might be the reason for the accuracy drop after enabling lookup table, and can be easily fixed by applying this change to external/cutlass/include/cutlass/gemm/collective/sm90_mma_tma_gmma_rs_warpspecialized_mixed_input.hpp in your fork.

@jwfromm
Copy link
Contributor Author

jwfromm commented Feb 25, 2025

@IwakuraRein The scales are all positive and I'm running with the latest cutlass head commit (as of yesterday). The link you posted doesnt seem to include any changes to sm90_mma_tma_gmma_rs_warpspecialized_mixed_input.hpp, did you mean to paste a differint one?

@IwakuraRein
Copy link

@jwfromm Sorry I mean the changes in include/cutlass/detail/collective/mixed_input_utils.hpp in that link. But since your scales are all positive and I'm running with the latest cutlass then I guess this is not the issue.

@IwakuraRein
Copy link

IwakuraRein commented Mar 3, 2025

fbgemm_gpu/experimental/gen_ai/bench/quantize_ops.py:1145:

-     scales = scales.view(x.shape[0], -1)
+     scales = scales.view(x.shape[0], -1).t().contiguous()

fbgemm_gpu/experimental/gen_ai/src/quantize/cutlass_extensions/mixed_dtype_utils.cu:59:

- StrideB stride_B;
+ StrideB stride_B = cutlass::make_cute_packed_stride(StrideB{}, shape_B);

These should fix the bugs.

@jwfromm jwfromm force-pushed the export-D69955197 branch from bbca782 to 15a5738 Compare March 5, 2025 01:48
jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Mar 5, 2025
Summary:

Enable new preshuffled FP8 x I4 kernels. These are the most performant mixed dtype kernels to date and dramatically outperform prior approaches including those in FBGEMM, marlin, and Machete.

Differential Revision: D69955197
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69955197

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Mar 5, 2025
Summary:
X-link: facebookresearch/FBGEMM#846


Enable new preshuffled FP8 x I4 kernels. These are the most performant mixed dtype kernels to date and dramatically outperform prior approaches including those in FBGEMM, marlin, and Machete.

Differential Revision: D69955197
jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Mar 5, 2025
Summary:
X-link: facebookresearch/FBGEMM#846


Enable new preshuffled FP8 x I4 kernels. These are the most performant mixed dtype kernels to date and dramatically outperform prior approaches including those in FBGEMM, marlin, and Machete.

Reviewed By: jiawenliu64

Differential Revision: D69955197
jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Mar 5, 2025
Summary:
X-link: facebookresearch/FBGEMM#846


Enable new preshuffled FP8 x I4 kernels. These are the most performant mixed dtype kernels to date and dramatically outperform prior approaches including those in FBGEMM, marlin, and Machete.

Reviewed By: jiawenliu64

Differential Revision: D69955197
jwfromm pushed a commit to jwfromm/FBGEMM that referenced this pull request Mar 5, 2025
Summary:
Pull Request resolved: pytorch#3722

WIP to enable new optimized preshuffled fp8xint4 gemm.

Differential Revision: D69955197
jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Mar 6, 2025
Summary:
X-link: facebookresearch/FBGEMM#846


Enable new preshuffled FP8 x I4 kernels. These are the most performant mixed dtype kernels to date and dramatically outperform prior approaches including those in FBGEMM, marlin, and Machete.

Reviewed By: jiawenliu64

Differential Revision: D69955197
jiawenliu64 pushed a commit to jiawenliu64/FBGEMM that referenced this pull request Mar 6, 2025
Summary:
Pull Request resolved: pytorch#3722

WIP to enable new optimized preshuffled fp8xint4 gemm.

Differential Revision: D69955197
jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Mar 6, 2025
Summary:
X-link: facebookresearch/FBGEMM#846


Enable new preshuffled FP8 x I4 kernels. These are the most performant mixed dtype kernels to date and dramatically outperform prior approaches including those in FBGEMM, marlin, and Machete.

Reviewed By: jiawenliu64

Differential Revision: D69955197
@jwfromm jwfromm force-pushed the export-D69955197 branch from 15a5738 to 3e2ddfa Compare March 6, 2025 18:27
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69955197

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Mar 6, 2025
Summary:
X-link: facebookresearch/FBGEMM#846


Enable new preshuffled FP8 x I4 kernels. These are the most performant mixed dtype kernels to date and dramatically outperform prior approaches including those in FBGEMM, marlin, and Machete.

Reviewed By: jiawenliu64

Differential Revision: D69955197
jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Mar 6, 2025
Summary:
X-link: facebookresearch/FBGEMM#846


Enable new preshuffled FP8 x I4 kernels. These are the most performant mixed dtype kernels to date and dramatically outperform prior approaches including those in FBGEMM, marlin, and Machete.

Reviewed By: jiawenliu64

Differential Revision: D69955197
@jwfromm jwfromm force-pushed the export-D69955197 branch from 3e2ddfa to 8407869 Compare March 7, 2025 17:32
jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Mar 7, 2025
Summary:
X-link: facebookresearch/FBGEMM#846


Enable new preshuffled FP8 x I4 kernels. These are the most performant mixed dtype kernels to date and dramatically outperform prior approaches including those in FBGEMM, marlin, and Machete.

Reviewed By: jiawenliu64

Differential Revision: D69955197
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D69955197

jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Mar 7, 2025
Summary:
X-link: facebookresearch/FBGEMM#846


Enable new preshuffled FP8 x I4 kernels. These are the most performant mixed dtype kernels to date and dramatically outperform prior approaches including those in FBGEMM, marlin, and Machete.

Reviewed By: jiawenliu64

Differential Revision: D69955197
jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Mar 7, 2025
Summary:
X-link: facebookresearch/FBGEMM#846


Enable new preshuffled FP8 x I4 kernels. These are the most performant mixed dtype kernels to date and dramatically outperform prior approaches including those in FBGEMM, marlin, and Machete.

Reviewed By: jiawenliu64

Differential Revision: D69955197
jwfromm added a commit to jwfromm/FBGEMM that referenced this pull request Mar 7, 2025
Summary:
X-link: facebookresearch/FBGEMM#846


Enable new preshuffled FP8 x I4 kernels. These are the most performant mixed dtype kernels to date and dramatically outperform prior approaches including those in FBGEMM, marlin, and Machete.

Reviewed By: jiawenliu64

Differential Revision: D69955197
jwfromm pushed a commit to jwfromm/FBGEMM that referenced this pull request Mar 7, 2025
Summary:
Pull Request resolved: pytorch#3722

WIP to enable new optimized preshuffled fp8xint4 gemm.

Differential Revision: D69955197
Summary:

Enable new preshuffled FP8 x I4 kernels. These are the most performant mixed dtype kernels to date and dramatically outperform prior approaches including those in FBGEMM, marlin, and Machete.

Differential Revision: D69955197
@facebook-github-bot
Copy link
Contributor

This pull request has been merged in 27724d9.

q10 pushed a commit to q10/FBGEMM that referenced this pull request Apr 10, 2025
Summary:
Pull Request resolved: facebookresearch/FBGEMM#846

X-link: pytorch#3722

Enable new preshuffled FP8 x I4 kernels. These are the most performant mixed dtype kernels to date and dramatically outperform prior approaches including those in FBGEMM, marlin, and Machete.

Reviewed By: jiawenliu64

Differential Revision: D69955197

fbshipit-source-id: 151b5dd96728b82b4bf5a4c3967310d055d56094
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants