Add INT8 SDPA path for CPU #1372

Valentine233 · 2024-12-03T08:01:05Z

For the integration of INT8 SDPA in TorchAO, we design a feasible path by registering a customized pass of PyTorch and adding the pattern matcher and kernel in TorchAO.

Steps:

Register and implement the INT8 SDPA kernel, i.e. torchao.ops.scaled_dot_product_int8, for CPU.
Add the pattern matchers for INT8 SDPA., in order to replace the decomposed OPs with torchao.ops.scaled_dot_product_int8.
Register a customized pass of PyTorch by defining the above patterns as torch._inductor.config.post_grad_custom_pre_pass.

Perf:
The validation is launched for int8-bf16 on GNR machine. The script is similar as the UT.

Model	Mode	E2E Speedup
BertLarge	Throughput	1.13
BertLarge	Realtime	1.03
VIT	Throughput	1.10
VIT	Realtime	1.03

pytorch-bot · 2024-12-03T08:01:09Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1372

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 7ed497a with merge base 25034e5 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Valentine233 · 2024-12-03T08:03:01Z

@drisspg @jerryzh168 @jgong5 @leslie-fang-intel Please help review for the POC, thanks!

drisspg · 2024-12-03T15:11:27Z

WIll do some more review on this later today but I think we might want to make a sub folder in prototype for 'inductor_patterns" since this is a pretty particular workflow and we should add a good readme explaining how it can be used and its limitations

vkuzo · 2024-12-03T23:35:54Z

Register a customized pass of PyTorch by defining the above patterns as torch._inductor.config.joint_custom_pre_pass.

Do I understand correctly that this using torch.compile to change the numerics of the model by hooking up a quantization pass to inductor? If yes, can this live in prototype for now? I'd have concerns about using torch.compile passes to change numerics being the official API, some of the challenges here include breaking the assumption that a compiler does not meaningfully change numerics.

jerryzh168 · 2024-12-04T01:26:42Z

Register a customized pass of PyTorch by defining the above patterns as torch._inductor.config.joint_custom_pre_pass.

Do I understand correctly that this using torch.compile to change the numerics of the model by hooking up a quantization pass to inductor? If yes, can this live in prototype for now? I'd have concerns about using torch.compile passes to change numerics being the official API, some of the challenges here include breaking the assumption that a compiler does not meaningfully change numerics.

I believe numerics changes happens in pt2e quant api this should only do fusion

jerryzh168 · 2024-12-04T01:27:21Z

torchao/quantization/sfdp_int8_fx_pass.py

+def _sfdp_init_int8():
+    for key, register_replacement_kwargs in _gen_sfdp_patterns_int8():
+        register_replacement(**register_replacement_kwargs)
+    config.joint_custom_pre_pass = patterns.apply


is this the official API to add new fusion passes to inductor? what if we have multiple fusion passes that we need to add? i.e. we probably want to move all intel quant passes to torchao in the future as well

According to https://github.com/pytorch/pytorch/blob/main/torch/_inductor/config.py#L165, I suppose only one fusion pass could be assigned, also the same case for other customized passes. It is better to expand this to a list of passes. Maybe need more comments from @Chillee @eellison.

Can we report an issue to track it? If multiple libraries register the joint_custom_pre_pass, it will fail to work implicitly.

Report an issue: pytorch/pytorch#151876

vkuzo · 2024-12-04T03:13:44Z

Register a customized pass of PyTorch by defining the above patterns as torch._inductor.config.joint_custom_pre_pass.

Do I understand correctly that this using torch.compile to change the numerics of the model by hooking up a quantization pass to inductor? If yes, can this live in prototype for now? I'd have concerns about using torch.compile passes to change numerics being the official API, some of the challenges here include breaking the assumption that a compiler does not meaningfully change numerics.

I believe numerics changes happens in pt2e quant api this should only do fusion

I see that rtol and atol are pretty significant when comparing the baseline to the workflow introduced in this PR, is that intended? Are you saying that with the proper setup we can do the comparison between using this inductor pass vs not using it to pass with rtol/atol near zero?

Valentine233 · 2024-12-04T03:17:43Z

Register a customized pass of PyTorch by defining the above patterns as torch._inductor.config.joint_custom_pre_pass.

Do I understand correctly that this using torch.compile to change the numerics of the model by hooking up a quantization pass to inductor? If yes, can this live in prototype for now? I'd have concerns about using torch.compile passes to change numerics being the official API, some of the challenges here include breaking the assumption that a compiler does not meaningfully change numerics.

I believe numerics changes happens in pt2e quant api this should only do fusion

I see that rtol and atol are pretty significant when comparing the baseline to the workflow introduced in this PR, is that intended? Are you saying that with the proper setup we can do the comparison between using this inductor pass vs not using it to pass with rtol/atol near zero?

There exists some numeric issues for the kernel currently, and I would update it finally. Ideally for quantized dtype, the atol is expected to be 1.5 (refer to https://github.com/pytorch/pytorch/blob/main/test/quantization/core/test_quantized_op.py#L5167), and the rol is the default value.

test/test_ops.py

torchao/ops.py

drisspg · 2024-12-04T15:18:57Z

torchao/csrc/cpu/sdpa.cpp

+  //   dropout_p, is_causal, attn_mask, scale,
+  //   q_zp, q_scale,
+  //   k_zp, k_scale,
+  //   v_zp, v_scale,


decent amount of commented code here

Thanks, the unused codes are removed.

vkuzo · 2024-12-04T16:55:06Z

@jerryzh168 or @drisspg , could you help make sure the PR summary includes the high level description of the flow and explains why the inductor pass is used to swap to the final kernel, this isn't expected, it's somewhat makes sense to me but it would be good to just explain what the plan is in a way that's easily discoverable

vkuzo · 2024-12-04T16:57:08Z

Register a customized pass of PyTorch by defining the above patterns as torch._inductor.config.joint_custom_pre_pass.

Do I understand correctly that this using torch.compile to change the numerics of the model by hooking up a quantization pass to inductor? If yes, can this live in prototype for now? I'd have concerns about using torch.compile passes to change numerics being the official API, some of the challenges here include breaking the assumption that a compiler does not meaningfully change numerics.

I believe numerics changes happens in pt2e quant api this should only do fusion

I see that rtol and atol are pretty significant when comparing the baseline to the workflow introduced in this PR, is that intended? Are you saying that with the proper setup we can do the comparison between using this inductor pass vs not using it to pass with rtol/atol near zero?

There exists some numeric issues for the kernel currently, and I would update it finally. Ideally for quantized dtype, the atol is expected to be 1.5 (refer to https://github.com/pytorch/pytorch/blob/main/test/quantization/core/test_quantized_op.py#L5167), and the rol is the default value.

cc @jerryzh168 . The atol linked above is AFAIK for comparisons between high precision and low precision, which is numerics changing. If we're saying that this doesn't change numerics, I'd expect atol|rtol to be near 0.

test/quantization/test_sfdp_int8_fx_pass.py

jerryzh168 · 2024-12-04T18:08:49Z

@vkuzo are you referring to atol=1.0 (https://github.com/pytorch/ao/pull/1372/files#diff-fe0aa67bd65dd5da118abe44f45104170ba9871cb14bd446a448d056599d462aR189)? this does look a bit large. I think there is some expected difference when we fuse dq - op - q patterns to real quantized ops, it's probably expected to have slight changes to numerics, so it's a bit different compared to floating point fusions

jerryzh168 · 2024-12-04T18:10:25Z

torchao/ops.py

@@ -71,6 +72,56 @@ def _(
    return _in_feats.new_empty((BS, OC))


+def scaled_dot_product_int8(


please add for_cpu to the op name, since we will likely add some gpu ops as well

This is supposed to be used by all the backends, and each backend could register its own implementation. For example, for CPU path https://github.com/pytorch/ao/pull/1372/files#diff-eaf2387d03cf16395487f5f4162420a8e84bb89e9f1221a01474d2f87ff449ddR2080-R2082.

Or do you think the API for CPU and GPU could be different?

I think its fine as is, we are planning to add a variant for FAv3 like fp8 attention which would be slightly different I imagine, but it will be prototype

cc @jbschlosser

…torch#1372) Let's gracefully fail if no model is given to the `download` command. Signed-off-by: Sébastien Han <seb@redhat.com>

### Description During the support of INT8 SDPA pytorch/ao#1372, we find that `at::vec::vec_reduce_all<int32_t>` would go into slow scalar path when doing sum and max. So here, we support the two reduce-related ops `reduce_add` and `reduce_max` for `vec512` and `vec256`, using the Sequence instructions. ### Details - Support vectorized `reduce_add` and `reduce_max` for dtypes `int32` and `float32`, using the Sequence instructions; - Implement the scalar version for fallback path in vec base; - Add the operator `reduce` in vec base, in order to simplify the codes. Pull Request resolved: #144065 Approved by: https://github.com/mingfeima

In lowering, support the parameter `out_dtype` for `dequant_per_tensor` and `dequant_per_channel`. Fix the following runtime error issue found in pytorch/ao#1372: ``` File "/home/liaoxuan/pytorch_ao/torch/_inductor/lowering.py", line 452, in wrapped out = decomp_fn(*args, **kwargs) torch._dynamo.exc.BackendCompilerFailed: backend='compile_fx_wrapper' raised: LoweringException: TypeError: quantized_decomposed_dequantize_per_tensor_default() got an unexpected keyword argument 'out_dtype' target: quantized_decomposed.dequantize_per_tensor.default args[0]: TensorBox(StorageBox( InputBuffer(name='arg0_1', layout=FixedLayout('cpu', torch.uint8, size=[1, 7, 7, 9], stride=[441, 63, 9, 1])) )) args[1]: 0.01 args[2]: 100 args[3]: 0 args[4]: 255 args[5]: torch.uint8 kwargs: {'out_dtype': torch.bfloat16} ``` Pull Request resolved: #143845 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel

msaroufim · 2025-02-15T01:27:20Z

@Valentine233 do you have any perf benchmarks? Naively I would have imagined that SDPA isn't as crucial to implement on CPU because the caches are are larger and hardware managed but perhaps my mathematical intuition of what CPU SDPA is lacking so anything you could share would be super helpful

Valentine233 · 2025-02-26T02:57:23Z

do you have any perf benchmarks? Naively I would have imagined that SDPA isn't as crucial to implement on CPU because the caches are are larger and hardware managed but perhaps my mathematical intuition of what CPU SDPA is lacking so anything you could share would be super helpful

We also encounter cache problems on CPU for long sequence lengths, so the fused SDPA does have perf benefits. I would share the results in the descriptions, as soon as they are prepared.

torchao/csrc/cpu/int8_sdpa.cpp

Xia-Weiwen · 2025-04-18T05:26:21Z

@leslie-fang-intel @Xia-Weiwen do you want to move other inductor quant passes here as well

I think so since the q and dq ops are deprecated in PyTorch core #1372 (comment). We may need to think more about it, such as the pace to do it and how to keep backward compatibility. We may also need to add more registration points for custom passes in Inductor to accommodate all passes. Leslie may comment more if needed. Thanks.

leslie-fang-intel · 2025-04-18T05:31:07Z

@leslie-fang-intel @Xia-Weiwen do you want to move other inductor quant passes here as well

I think so since the q and dq ops are deprecated in PyTorch core #1372 (comment). We may need to think more about it, such as the pace to do it and how to keep backward compatibility. We may also need to add more registration points for custom passes in Inductor to accommodate all passes. Leslie may comment more if needed. Thanks.

Yean, since the new q and dq ops like torch.ops.torchao.dequantize_affine will be registered in TorchAO and be used in TorchAO PT2E flow, I feel we need to register patterns with these new ops in TorchAO.

jerryzh168 · 2025-04-21T22:46:52Z

looks like there is a build issue? ao/torchao/csrc/cpu/int8_sdpa.cpp:1:9: error: #pragma once in main file [-Werror,-Wpragma-once-outside-header]
1 | #pragma once
| ^
1 error generated.

jerryzh168 · 2025-04-21T23:25:35Z

will need to revert this one due to internal build errors, please land again

This reverts commit 34421b1.

Revert "Add INT8 SDPA path for CPU (#1372)" This reverts commit 34421b1.

kimishpatel · 2025-05-05T03:45:51Z

I also dont see clear description of what int8 sdpa really entails numerically. I implemented something quantized sdpa where q, k and v are per token quantized but output of sdpa is still fp32. I dont know if that is the intent here as well. But if we are landing this in AO then we should make sure such implementations are aligned. cc; @jerryzh168 @drisspg

leslie-fang-intel · 2025-05-06T00:51:27Z

I also dont see clear description of what int8 sdpa really entails numerically. I implemented something quantized sdpa where q, k and v are per token quantized but output of sdpa is still fp32. I dont know if that is the intent here as well. But if we are landing this in AO then we should make sure such implementations are aligned.

I think the output data type should be the same as input which should be u8. Hi @kimishpatel, could you share more details about your test case? so cc @Valentine233 can also help to take a look.

kimishpatel · 2025-05-06T02:41:01Z

I also dont see clear description of what int8 sdpa really entails numerically. I implemented something quantized sdpa where q, k and v are per token quantized but output of sdpa is still fp32. I dont know if that is the intent here as well. But if we are landing this in AO then we should make sure such implementations are aligned.

I think the output data type should be the same as input which should be u8. Hi @kimishpatel, could you share more details about your test case? so cc @Valentine233 can also help to take a look.

I think we have treated quantized sdpa as dynamic quantized linear such that q, k and v are int8 but output is fp32. In larger part to avoid doing softmax in low precision which is where of the numerics issues of sdpa lie

kimishpatel · 2025-05-06T02:41:44Z

In any case if you can update the description or add comment as to what the op's numerics are it would be great, e.g. q @ k is int8 @ int8 -> int8? what about softmax etc.

Valentine233 · 2025-05-07T02:05:23Z

In any case if you can update the description or add comment as to what the op's numerics are it would be great, e.g. q @ k is int8 @ int8 -> int8? what about softmax etc.

Here is the workflow for quantized SDPA:

kimishpatel · 2025-05-07T03:08:30Z

In any case if you can update the description or add comment as to what the op's numerics are it would be great, e.g. q @ k is int8 @ int8 -> int8? what about softmax etc.

Here is the workflow for quantized SDPA:

Have you validated impact of this on any real models?

Valentine233 · 2025-05-07T05:25:17Z

Have you validated impact of this on any real models?

Sure, for Bertlarge and VIT. You can see the results in the description part.

kimishpatel · 2025-07-01T14:32:59Z

Have you validated impact of this on any real models?

Sure, for Bertlarge and VIT. You can see the results in the description part.

actually i was asking for accuracy measurements and more so for decoder only model as well.

However I see that reported speed up is in the range of 3-15%. is that right?

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 3, 2024

jerryzh168 reviewed Dec 4, 2024

View reviewed changes

drisspg reviewed Dec 4, 2024

View reviewed changes

test/test_ops.py Outdated Show resolved Hide resolved

drisspg reviewed Dec 4, 2024

View reviewed changes

torchao/ops.py Show resolved Hide resolved

drisspg reviewed Dec 4, 2024

View reviewed changes

jerryzh168 reviewed Dec 4, 2024

View reviewed changes

test/quantization/test_sfdp_int8_fx_pass.py Outdated Show resolved Hide resolved

jerryzh168 reviewed Dec 4, 2024

View reviewed changes

This was referenced Dec 17, 2024

[Inductor] Register pattern match with dropout in torchao as a customized pass pytorch/pytorch#143363

Closed

[cpu] add int8 sdpa api pytorch/pytorch#138688

Closed

Valentine233 mentioned this pull request Dec 26, 2024

[Inductor][lowering] support out_dtype for dequant lowering pytorch/pytorch#143845

Closed

Valentine233 mentioned this pull request Jan 2, 2025

[cpu][vec] support reduce ops for add and max pytorch/pytorch#144065

Closed

Valentine233 force-pushed the int8_sdpa_cpu branch from 7318d2a to 526ae54 Compare January 7, 2025 06:15

drisspg mentioned this pull request Feb 5, 2025

[RFC] Add CPP INT8 SDPA Template for Inductor CPU pytorch/pytorch#144941

Closed

sanchitintel reviewed Mar 1, 2025

View reviewed changes

torchao/csrc/cpu/int8_sdpa.cpp Show resolved Hide resolved

torchao/csrc/cpu/int8_sdpa.cpp Outdated Show resolved Hide resolved

Valentine233 force-pushed the int8_sdpa_cpu branch from 8933c53 to ecc66e4 Compare March 4, 2025 03:09

Valentine233 added 9 commits April 18, 2025 05:25

rebase and update

062ab9f

fix issue

51432ab

fix issue

afd97b1

fix issue

7040a6e

fix issue

3da5988

set strict value for export_for_training

464bdd4

modify name in setup

25017f5

refactor code according to comments

a99e4aa

change param orders

7ed497a

Valentine233 force-pushed the int8_sdpa_cpu branch from 434dce2 to 7ed497a Compare April 18, 2025 05:26

Valentine233 merged commit 34421b1 into pytorch:main Apr 18, 2025
34 checks passed

jerryzh168 added a commit that referenced this pull request Apr 21, 2025

Revert "Add INT8 SDPA path for CPU (#1372)"

880e694

This reverts commit 34421b1.

jerryzh168 mentioned this pull request Apr 21, 2025

Revert "Add INT8 SDPA path for CPU" #2092

Merged

jerryzh168 added a commit that referenced this pull request Apr 22, 2025

Revert "Add INT8 SDPA path for CPU" (#2092)

7eb6125

Revert "Add INT8 SDPA path for CPU (#1372)" This reverts commit 34421b1.

Valentine233 mentioned this pull request Apr 22, 2025

Re-land "Add INT8 SDPA path for CPU" #2093

Merged

Valentine233 mentioned this pull request Apr 29, 2025

Support INT8 SDPA template for CPU #2148

Merged

atalman mentioned this pull request May 2, 2025

nightly build for mac stops on 0422 #2157

Closed

Valentine233 mentioned this pull request May 16, 2025

Re-land the PR of "Add INT8 SDPA path for CPU" #2215

Merged

		@@ -71,6 +72,56 @@ def _(
		return _in_feats.new_empty((BS, OC))


		def scaled_dot_product_int8(

Add INT8 SDPA path for CPU #1372

Add INT8 SDPA path for CPU #1372

Uh oh!

Conversation

Valentine233 commented Dec 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1372

✅ No Failures

Uh oh!

Valentine233 commented Dec 3, 2024

Uh oh!

drisspg commented Dec 3, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vkuzo commented Dec 3, 2024

Uh oh!

jerryzh168 commented Dec 4, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vkuzo commented Dec 4, 2024

Uh oh!

Valentine233 commented Dec 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vkuzo commented Dec 4, 2024

Uh oh!

vkuzo commented Dec 4, 2024

Uh oh!

Uh oh!

jerryzh168 commented Dec 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

msaroufim commented Feb 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Valentine233 commented Feb 26, 2025

Uh oh!

Uh oh!

Uh oh!

Xia-Weiwen commented Apr 18, 2025

Uh oh!

leslie-fang-intel commented Apr 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

jerryzh168 commented Apr 21, 2025

Uh oh!

jerryzh168 commented Apr 21, 2025

Uh oh!

kimishpatel commented May 5, 2025

Uh oh!

leslie-fang-intel commented May 6, 2025

Uh oh!

kimishpatel commented May 6, 2025

Uh oh!

kimishpatel commented May 6, 2025

Uh oh!

Valentine233 commented Dec 3, 2024 •

edited

Loading

pytorch-bot bot commented Dec 3, 2024 •

edited

Loading

drisspg commented Dec 3, 2024 •

edited

Loading

Valentine233 commented Dec 4, 2024 •

edited

Loading

jerryzh168 commented Dec 4, 2024 •

edited

Loading

msaroufim commented Feb 15, 2025 •

edited

Loading

leslie-fang-intel commented Apr 18, 2025 •

edited

Loading