Extend vec backend with BF16 SVE intrinsics #143666

Ryo-not-rio · 2024-12-20T18:11:02Z

Following the work in Extending the Pytorch vec backend for SVE (ARM) #119571, BF16 SVE intrinsics are added to the Vectorized class, providing ~1.7x speedup on silu and softmax.
Added bf16 detection in CMake
Added a guard for native NEON code to prevent compilation errors

@aditew01 @maajidkhann please have a look

cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @malfet @snadampal @milpuz01 @aditew01 @nikhil-arm @fadara01 @EikanWang @voznesenskym @penguinwu @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @xmfan @kwen2501 @c-p-i-o @yf225 @LucasLLC @MeetVadakkanchery @mhorowitz @pradeepfn @ekr0 @StrongerXi @ColinPeppler @desertfire

pytorch-bot · 2024-12-20T18:11:07Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143666

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit 0e90e42 with merge base 783f045 ():

NEW FAILURE - The following job has failed:

pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu) (gh)
REGRESSION: benchmark ('basic_modules_ListOfLinears_inductor_gpu_force_shape_pad', 'compile_time_instruction_count') failed, actual result 15664022295 is 1.65% higher than expected 15410000000 ±+1.50% if this is an expected regression, please update the expected results.

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

inductor / unit-test / linux-jammy-cpu-py3.12-gcc11-inductor-halide / build (gh) (trunk failure)
pull / linux-jammy-xpu-2025.0-py3.9 / build (gh) (trunk failure)
/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/sstream:152:52: error: expected value in expression

This comment was automatically generated by Dr. CI and updates every 15 minutes.

aditew01 · 2024-12-20T20:29:25Z

@pytorchbot label "topic: not user facing"

pytorch-bot · 2024-12-20T20:29:53Z

To add the ciflow label ciflow/linux-aarch64 please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

torch/_inductor/cpu_vec_isa.py

swolchok

not particularly familiar with SVE, but fairly sure I found a bug, see inline comments

aten/src/ATen/cpu/vec/sve/vec_bfloat16.h

swolchok · 2025-01-07T16:57:59Z

aten/src/ATen/cpu/vec/sve/vec_bfloat16.h

+  auto bf16_vec1 = svzip1_bf16(zero, a);
+  auto bf16_vec2 = svzip2_bf16(zero, a);


I would naively expect reinterpreting as u16 and doing a widening left shift to be better, but I have no particular hardware performance chart I'm referencing

This is a possible alternative implementation, I've left it as is for now as it is the simplest way I've come across

aten/src/ATen/cpu/vec/sve/vec_bfloat16.h

swolchok · 2025-01-07T17:02:41Z

aten/src/ATen/cpu/vec/vec256/vec256_half.h

@@ -0,0 +1,857 @@
+#pragma once


splitting fp16 out of vec256_bfloat16.h would be much easier to review as its own stacked PR

I think this has to be in this PR as vec256_bfloat16.h has to be guarded separately in vec256.h

aten/src/ATen/native/cpu/ReducedPrecisionFloatGemvFastPathKernel.cpp

swolchok

another comparison bug (I think)

swolchok · 2025-01-13T17:08:53Z

aten/src/ATen/cpu/vec/sve/vec_bfloat16.h

-  v2 = v2.rsqrt();
-  return convert_float_bfloat16(v1, v2);
+  auto [v3, v4] = convert_bfloat16_float(other);
+  return convert_float_bfloat16(v1 > v3, v2 > v4);


I think this might cause problems -- the masks returned by comparison operators are NaNs if set, and could end up being canonicalized by the conversion. https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec128/vec128_bfloat16_neon.h#L218 please make sure tests cover this

I'm not sure I understand what you mean, I see that the comparison operators are covered in vec_test_all_types.cpp and they pass so I'm assuming it is okay?

hmm, you are right that they're covered. If the conversion is implemented with integer shift instructions then it's going to be fine; if there's a hardware instruction that isn't actually just a left shift then it might do NaN canonicalization and wouldn't be fine. The existence of a hardware instruction that isn't just a left shift would be weird, so I guess this must be fine.

swolchok

I think you need maintainer approval to actually merge this, but I personally have no more complaints. Thanks!

Ryo-not-rio · 2025-01-27T16:35:20Z

@pytorchbot label "arm priority"

Ryo-not-rio · 2025-01-27T18:20:55Z

@swolchok would you be able to trigger the pipelines by adding the "ciflow/linux-aarch64" label? Thanks!

Ryo-not-rio · 2025-01-27T18:21:13Z

@malfet could you have a look at this PR please? Thanks!

pytorch-bot · 2025-01-27T18:23:09Z

To add the ciflow label ciflow/linux-aarch64 please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

malfet

This is a big PR and I have not looked thru it all, but VecSVE_BF16 class and how one shifts detection on whether ISA is supported from the class down the callstack feels wrong to me. If you disagree, please elaborate why this can't be incorporate into an existing VecSVE class?

Also, as suggested by @swolchok do you mind splitting this one into a series of smaller PRs for the ease of reviews (because it feels like a mix of refactoring and feature work.)

And is SVE_BF16 supported on Graviton3? (I.e. are those perf gains should be observable by the benchmark CI is running?

malfet · 2025-01-27T18:38:27Z

aten/src/ATen/cpu/Utils.h

 TORCH_API bool is_arm_sve_supported();

+// Detect if CPU supports Arm(R) architecture SVE ISA and BF16
+TORCH_API bool is_arm_sve_bf16_supported();


Unrelated to this PR, but re-exporting cpuinfo APIs this way looks a bit like an anti-pattern...

malfet · 2025-01-27T18:40:13Z

torch/csrc/cpu/Module.cpp

  cpu.def("_is_amx_fp16_supported", at::cpu::is_amx_fp16_supported);
  cpu.def("_init_amx", at::cpu::init_amx);
  cpu.def("_is_arm_sve_supported", at::cpu::is_arm_sve_supported);
+  cpu.def("_is_arm_sve_bf16_supported", at::cpu::is_arm_sve_bf16_supported);


Again, this looks like an anti-pattern. Perhaps a better API would be to return a named dict with all CPU properties...

malfet · 2025-01-27T18:44:58Z

torch/_inductor/cpu_vec_isa.py

+
+
+@dataclasses.dataclass
+class VecSVE_BF16(VecSVE):


Hmm, can you please elaborate why one needs VecSVE_BF16, but there aren't say VecAVX512_Amx?
IMO architecture flavors and respective build flags should be handled by one class and it's method bool should return true or false on whether architecture is supported

There is indeed a separate class for Amx: VecAMX(). We could put the BF16 functionality inside VecSVE but to me that feels like going against the existing pattern and making the code inconsistent

So, considering that another PR wants to add 3 variants of VecSVE, do you plan to add 3 more variants of VecSVE128_BF16, VecSVE256_BF16 etc?

malfet · 2025-02-04T23:08:03Z

Removing arm priority as PR has 32 CI failures and conflicts with trunk. Please rebase, fix the failure and than let's have another round of reviews.

- Updates hashes for: - PyTorch 114d404b0720e8073748690faeb96449e5c0b229, 2.8.0.dev20250327 from viable/strict - ideep to 719d8e6cd7f7a0e01b155657526d693acf97c2b3 from ideep_pytorch - oneDNN to 5de25f354afee38bf2db61f485c729d30f62c611 from main - Compute Library to 9033bdacdc3840c80762bc56e8facb87b0e1048e, 25.03 release - OpenBLAS to edef2e4441e50e3a2da1920fdbde09101087c43d from main - Removes WIP patches which have now landed in the upstream nightly PyTorch builds. - Temporarily removes pytorch/pytorch#143666 as it requires a rebase. - Removes '--tags --force' from git clone command as it adds sinficant overhead to PyTorch checkout. - Pins cmake python package version to 3.31 to avoid known issue with cmake 4.0 - pytorch/pytorch#150167

- Updates hashes for: - PyTorch 114d404b0720e8073748690faeb96449e5c0b229, 2.8.0.dev20250327 from viable/strict - ideep to 719d8e6cd7f7a0e01b155657526d693acf97c2b3 from ideep_pytorch - oneDNN to 5de25f354afee38bf2db61f485c729d30f62c611 from main - Compute Library to 9033bdacdc3840c80762bc56e8facb87b0e1048e, 25.03 release - OpenBLAS to edef2e4441e50e3a2da1920fdbde09101087c43d from main - Removes WIP patches which have now landed in the upstream nightly PyTorch builds. - Temporarily removes pytorch/pytorch#143666 as it requires a rebase. - Removes '--tags --force' from git clone command as it adds significant overhead to PyTorch checkout. - Pins cmake python package version to 3.31 to avoid known issue with cmake 4.0 - pytorch/pytorch#150167

torch/_inductor/cpu_vec_isa.py

aditew01 · 2025-04-08T09:09:55Z

@Ryo-not-rio @malfet @swolchok
Link to overall benchmark from TorchInductor Performance DashBoard
It's worth highlighting that the Avg execution times shows a general improvement for all the models. Attaching a screenshot from the Huggingface results.

Perf-speedup may not be the best metric here to compare the results, given this path will show improvements in eager as well as compile mode.

@aditew01

- Following the work in pytorch#119571, BF16 SVE intrinsics are added to the Vectorized class, providing ~1.7x speedup on `silu` and `softmax`. - Added bf16 detection in CMake - Added a guard for native NEON code to prevent compilation errors @aditew01 @maajidkhann please have a look Pull Request resolved: pytorch#143666 Approved by: https://github.com/swolchok, https://github.com/aditew01 Co-authored-by: Aditya Tewari <aditya.tewari@arm.com>

This reverts commit d072254. Reverted pytorch#143666 on behalf of https://github.com/malfet due to I'm unsure why this PR got merged, as it doesn't have a valid review ([comment](pytorch#143666 (comment)))

malfet · 2025-04-28T16:47:40Z

@pytorchbot merge -i

pytorch-bot · 2025-04-28T16:47:46Z

This PR has pending changes requested. Please address the comments and update the PR before merging.

malfet · 2025-04-28T16:48:04Z

@pytorchbot merge -i

pytorch-bot · 2025-04-28T16:48:09Z

This PR has pending changes requested. Please address the comments and update the PR before merging.

malfet · 2025-04-28T16:50:24Z

@pytorchbot merge -i

pytorch-bot · 2025-04-28T16:50:29Z

This PR has pending changes requested. Please address the comments and update the PR before merging.

Looks like it was addressed a while back

malfet · 2025-04-28T18:17:06Z

@pytorchbot merge -i

pytorchmergebot · 2025-04-28T18:19:47Z

Merge started

Your change will be merged while ignoring the following 3 checks: pull / linux-jammy-xpu-2025.0-py3.9 / build, pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu), inductor / unit-test / linux-jammy-cpu-py3.12-gcc11-inductor-halide / build

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

luentong · 2025-04-30T07:24:40Z

Hi, I tested this SVE bfloat16 functionality on my SVE machine and it does inference slower than float32 ( about 1.4x to 2.0x in speed). Is it expected behavior as maybe the logic here is to first convert bfloat16 data back to some old data format? Also, is it possible to add float16 support for SVE instructions as well? Thanks a lot

nikhil-arm · 2025-04-30T09:44:45Z

Hi, I tested this SVE bfloat16 functionality on my SVE machine and it does inference slower than float32 ( about 1.4x to 2.0x in speed). Is it expected behavior as maybe the logic here is to first convert bfloat16 data back to some old data format? Also, is it possible to add float16 support for SVE instructions as well? Thanks a lot

Hello, can you share more details about your workload and maybe share a small reproducer?

pytorch-bot bot added module: cpu CPU specific problem (e.g., perf, algorithm) module: inductor labels Dec 20, 2024

pytorchbot added the open source label Dec 20, 2024

aditew01 added the module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 label Dec 20, 2024

aditew01 requested a review from swolchok December 20, 2024 20:29

pytorch-bot bot added the topic: not user facing topic category label Dec 20, 2024

aditew01 added the ciflow/linux-aarch64 linux aarch64 CI workflow label Dec 20, 2024

pytorch-bot bot removed the ciflow/linux-aarch64 linux aarch64 CI workflow label Dec 20, 2024

aditew01 reviewed Dec 20, 2024

View reviewed changes

torch/_inductor/cpu_vec_isa.py Outdated Show resolved Hide resolved

abhishek-iitmadras mentioned this pull request Dec 21, 2024

Extending SVE VEC Backend Support in PyTorch to SVE128 and SVE512. #138388

Open

zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 23, 2024

pytorch-bot bot added the module: dynamo label Jan 7, 2025

swolchok suggested changes Jan 7, 2025

View reviewed changes

swolchok suggested changes Jan 13, 2025

View reviewed changes

swolchok approved these changes Jan 15, 2025

View reviewed changes

pytorch-bot bot added the arm priority label Jan 27, 2025

swolchok added the ciflow/linux-aarch64 linux aarch64 CI workflow label Jan 27, 2025

pytorch-bot bot removed the ciflow/linux-aarch64 linux aarch64 CI workflow label Jan 27, 2025

malfet added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 27, 2025

malfet requested changes Jan 27, 2025

View reviewed changes

malfet removed the arm priority label Feb 4, 2025

Ryo-not-rio mentioned this pull request Feb 6, 2025

separate f16 vectorized class from bf16 #146596

Closed

Update type hint

0b2d54a

Ryo-not-rio mentioned this pull request Apr 2, 2025

Add BF16 SVE intrinsics #150527

Closed

aditew01 reviewed Apr 3, 2025

View reviewed changes

torch/_inductor/cpu_vec_isa.py Outdated Show resolved Hide resolved

Ryo-not-rio added 2 commits April 4, 2025 14:13

make SVE require BF16

369c3b7

Fix linting

0e90e42

aditew01 requested a review from malfet April 8, 2025 09:29

nikhil-arm requested a review from digantdesai April 11, 2025 15:13

nikhil-arm approved these changes Apr 11, 2025

View reviewed changes

aditew01 approved these changes Apr 24, 2025

View reviewed changes

malfet approved these changes Apr 28, 2025

View reviewed changes

pytorchmergebot added the merging label Apr 28, 2025

pytorchmergebot closed this in fcbbb03 Apr 28, 2025

pytorchmergebot removed the merging label Apr 28, 2025

		auto bf16_vec1 = svzip1_bf16(zero, a);
		auto bf16_vec2 = svzip2_bf16(zero, a);

Extend vec backend with BF16 SVE intrinsics #143666

Extend vec backend with BF16 SVE intrinsics #143666

Uh oh!

Conversation

Ryo-not-rio commented Dec 20, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Dec 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143666

❌ 1 New Failure, 2 Unrelated Failures

Uh oh!

aditew01 commented Dec 20, 2024

Uh oh!

pytorch-bot bot commented Dec 20, 2024

Uh oh!

Uh oh!

swolchok left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

swolchok left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

swolchok left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Ryo-not-rio commented Jan 27, 2025

Uh oh!

Ryo-not-rio commented Jan 27, 2025

Uh oh!

Ryo-not-rio commented Jan 27, 2025

Uh oh!

pytorch-bot bot commented Jan 27, 2025

Uh oh!

malfet left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

malfet commented Feb 4, 2025

Uh oh!

Uh oh!

aditew01 commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

malfet commented Apr 28, 2025

Uh oh!

Ryo-not-rio commented Dec 20, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Dec 20, 2024 •

edited

Loading

swolchok left a comment •

edited

Loading

malfet left a comment •

edited

Loading

aditew01 commented Apr 8, 2025 •

edited

Loading