[scaled grouped mm] integrate triton kernels into differentiable scaled grouped mm #2077

danielvegamyhre · 2025-04-18T16:17:01Z

Prior PR in stack: #2064

Summary

Integrate triton kernels from [scaled grouped mm] add triton kernels for float8 rowwise quantization with per-group/jagged scales #2064 into differentiable scaled grouped mm
Add benchmarking script for e2e fwd+bwd of differentiable scaled grouped mm and share results
Move old CPU loop based implementations of per-group scaling to test utils folder, since they're only needed for verifying numerical accuracy now.

Performance

TL;DR there is ~1.25x - 37x speedup using the triton kernels (most shapes speedups fell between 2x-6x).

CPU loop:

A_shape        B_shape           high_precision_dtype      time_us
-------------  ----------------  ----------------------  ---------
(256, 4096)    (4, 4096, 4096)   torch.bfloat16            4324.35
(256, 4096)    (8, 4096, 4096)   torch.bfloat16            8197.2
(256, 4096)    (16, 4096, 4096)  torch.bfloat16           15830.9
(4096, 4096)   (4, 4096, 4096)   torch.bfloat16            5211.11
(4096, 4096)   (8, 4096, 4096)   torch.bfloat16            9003.3
(4096, 4096)   (16, 4096, 4096)  torch.bfloat16           16720.7
(65536, 4096)  (4, 4096, 4096)   torch.bfloat16           31257
(65536, 4096)  (8, 4096, 4096)   torch.bfloat16           34253.2
(65536, 4096)  (16, 4096, 4096)  torch.bfloat16           40141.3

Triton kernels:

A_shape        B_shape           high_precision_dtype      time_us
-------------  ----------------  ----------------------  ---------
(256, 4096)    (4, 4096, 4096)   torch.bfloat16            835.657
(256, 4096)    (8, 4096, 4096)   torch.bfloat16            835.657
(256, 4096)    (16, 4096, 4096)  torch.bfloat16            832.382
(4096, 4096)   (4, 4096, 4096)   torch.bfloat16            830.429
(4096, 4096)   (8, 4096, 4096)   torch.bfloat16           4666.3
(4096, 4096)   (16, 4096, 4096)  torch.bfloat16           7502.13
(65536, 4096)  (4, 4096, 4096)   torch.bfloat16            840.335
(65536, 4096)  (8, 4096, 4096)   torch.bfloat16          26359.4
(65536, 4096)  (16, 4096, 4096)  torch.bfloat16          27910.4

pytorch-bot · 2025-04-18T16:17:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2077

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (6 Unrelated Failures)

As of commit 6821e44 with merge base 9af2a45 ():

FLAKY - The following job failed but was likely due to flakiness present on trunk:

Run Regression Tests / test (CPU 2.5.1, linux.4xlarge, torch==2.5.1 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh) (similar failure)
test/quantization/test_qat.py::TestQAT::test_qat_8da4w_prepare_vs_convert_2

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Run Regression Tests / test (CPU 2.4, linux.4xlarge, torch==2.4.0 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh) (trunk failure)
test/quantization/test_qat.py::TestQAT::test_qat_8da4w_prepare_vs_convert_2
Run Regression Tests / test (CUDA 2.4, linux.g5.12xlarge.nvidia.gpu, torch==2.4.0, cuda, 12.1) / linux-job (gh) (trunk failure)
test/quantization/test_qat.py::TestQAT::test_qat_8da4w_prepare_vs_convert_2
Run Regression Tests / test (CUDA 2.5.1, linux.g5.12xlarge.nvidia.gpu, torch==2.5.1 --index-url https://download.pytorch... / linux-job (gh) (trunk failure)
test/quantization/test_qat.py::TestQAT::test_qat_8da4w_prepare_vs_convert_2
Run Regression Tests / test-nightly (CPU Nightly, linux.4xlarge, --pre torch --index-url https://download.pytorch.org/wh... / linux-job (gh) (trunk failure)
test/quantization/test_qat.py::TestQAT::test_qat_8da4w_prepare_vs_convert_2
Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh) (trunk failure)
test/quantization/test_qat.py::TestQAT::test_qat_8da4w_prepare_vs_convert_2

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torchao/prototype/scaled_grouped_mm/kernels/test_jagged_float8_scales.py

lint update docstrings add bench script add bench script bench against compile comment clean up fix masks lint integrate triton kernels into scaled grouped mm lint

danielvegamyhre · 2025-04-22T19:46:39Z

Dr CI confirmed test failures are unrelated to this change (and I manually confirmed as well, they are QAT related)

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 18, 2025

danielvegamyhre force-pushed the int branch from 1de3d58 to 982b18b Compare April 18, 2025 16:18

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 18, 2025

danielvegamyhre requested a review from drisspg April 18, 2025 16:20

danielvegamyhre mentioned this pull request Apr 21, 2025

[scaled grouped mm] add triton kernels for float8 rowwise quantization with per-group/jagged scales #2064

Merged

drisspg reviewed Apr 21, 2025

View reviewed changes

torchao/prototype/scaled_grouped_mm/kernels/test_jagged_float8_scales.py Outdated Show resolved Hide resolved

danielvegamyhre changed the base branch from kernel to main April 21, 2025 18:56

danielvegamyhre force-pushed the int branch from 982b18b to 6b00767 Compare April 21, 2025 19:19

danielvegamyhre mentioned this pull request Apr 21, 2025

[lint] reformat qat files #2090

Merged

danielvegamyhre force-pushed the int branch from 946d757 to e261cab Compare April 21, 2025 22:11

drisspg approved these changes Apr 22, 2025

View reviewed changes

danielvegamyhre added 4 commits April 22, 2025 08:21

add triton kernels for float8 quantization with jagged rowwise scales

1955a40

lint update docstrings add bench script add bench script bench against compile comment clean up fix masks lint integrate triton kernels into scaled grouped mm lint

lint

22e34bd

reorganize

bb5ddb3

move tests to test dir

969be7b

danielvegamyhre force-pushed the int branch from ae84eda to 969be7b Compare April 22, 2025 15:21

danielvegamyhre added 4 commits April 22, 2025 08:38

lint

9e95c91

conditionally import triton

05e15e9

skip torch < 2.5

4b88373

skip at module level

8a37453

danielvegamyhre force-pushed the int branch from 16b3979 to 8a37453 Compare April 22, 2025 17:06

only run tests on compute capability 9.0+

6821e44

danielvegamyhre merged commit b8206d7 into main Apr 22, 2025
12 of 18 checks passed

danielvegamyhre mentioned this pull request Apr 29, 2025

[roadmap/tracker] Low precision MoE training #2147

Open

36 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[scaled grouped mm] integrate triton kernels into differentiable scaled grouped mm #2077

[scaled grouped mm] integrate triton kernels into differentiable scaled grouped mm #2077

Uh oh!

danielvegamyhre commented Apr 18, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Apr 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

danielvegamyhre commented Apr 22, 2025

Uh oh!

Uh oh!

Uh oh!

[scaled grouped mm] integrate triton kernels into differentiable scaled grouped mm #2077

[scaled grouped mm] integrate triton kernels into differentiable scaled grouped mm #2077

Uh oh!

Conversation

danielvegamyhre commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance

Uh oh!

pytorch-bot bot commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2077

✅ You can merge normally! (6 Unrelated Failures)

Uh oh!

Uh oh!

danielvegamyhre commented Apr 22, 2025

Uh oh!

Uh oh!

Uh oh!

danielvegamyhre commented Apr 18, 2025 •

edited

Loading

pytorch-bot bot commented Apr 18, 2025 •

edited

Loading