-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Extend vec backend with BF16 SVE intrinsics #143666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143666
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New Failure, 2 Unrelated FailuresAs of commit 0e90e42 with merge base 783f045 ( NEW FAILURE - The following job has failed:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@pytorchbot label "topic: not user facing" |
|
To add the ciflow label This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows. |
swolchok
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not particularly familiar with SVE, but fairly sure I found a bug, see inline comments
| auto bf16_vec1 = svzip1_bf16(zero, a); | ||
| auto bf16_vec2 = svzip2_bf16(zero, a); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would naively expect reinterpreting as u16 and doing a widening left shift to be better, but I have no particular hardware performance chart I'm referencing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a possible alternative implementation, I've left it as is for now as it is the simplest way I've come across
| @@ -0,0 +1,857 @@ | |||
| #pragma once | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
splitting fp16 out of vec256_bfloat16.h would be much easier to review as its own stacked PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this has to be in this PR as vec256_bfloat16.h has to be guarded separately in vec256.h
swolchok
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another comparison bug (I think)
| v2 = v2.rsqrt(); | ||
| return convert_float_bfloat16(v1, v2); | ||
| auto [v3, v4] = convert_bfloat16_float(other); | ||
| return convert_float_bfloat16(v1 > v3, v2 > v4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might cause problems -- the masks returned by comparison operators are NaNs if set, and could end up being canonicalized by the conversion. https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec128/vec128_bfloat16_neon.h#L218 please make sure tests cover this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand what you mean, I see that the comparison operators are covered in vec_test_all_types.cpp and they pass so I'm assuming it is okay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, you are right that they're covered. If the conversion is implemented with integer shift instructions then it's going to be fine; if there's a hardware instruction that isn't actually just a left shift then it might do NaN canonicalization and wouldn't be fine. The existence of a hardware instruction that isn't just a left shift would be weird, so I guess this must be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need maintainer approval to actually merge this, but I personally have no more complaints. Thanks!
|
@pytorchbot label "arm priority" |
|
@swolchok would you be able to trigger the pipelines by adding the "ciflow/linux-aarch64" label? Thanks! |
|
@malfet could you have a look at this PR please? Thanks! |
|
To add the ciflow label This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a big PR and I have not looked thru it all, but VecSVE_BF16 class and how one shifts detection on whether ISA is supported from the class down the callstack feels wrong to me. If you disagree, please elaborate why this can't be incorporate into an existing VecSVE class?
Also, as suggested by @swolchok do you mind splitting this one into a series of smaller PRs for the ease of reviews (because it feels like a mix of refactoring and feature work.)
And is SVE_BF16 supported on Graviton3? (I.e. are those perf gains should be observable by the benchmark CI is running?
aten/src/ATen/cpu/Utils.h
Outdated
| TORCH_API bool is_arm_sve_supported(); | ||
|
|
||
| // Detect if CPU supports Arm(R) architecture SVE ISA and BF16 | ||
| TORCH_API bool is_arm_sve_bf16_supported(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR, but re-exporting cpuinfo APIs this way looks a bit like an anti-pattern...
torch/csrc/cpu/Module.cpp
Outdated
| cpu.def("_is_amx_fp16_supported", at::cpu::is_amx_fp16_supported); | ||
| cpu.def("_init_amx", at::cpu::init_amx); | ||
| cpu.def("_is_arm_sve_supported", at::cpu::is_arm_sve_supported); | ||
| cpu.def("_is_arm_sve_bf16_supported", at::cpu::is_arm_sve_bf16_supported); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, this looks like an anti-pattern. Perhaps a better API would be to return a named dict with all CPU properties...
torch/_inductor/cpu_vec_isa.py
Outdated
|
|
||
|
|
||
| @dataclasses.dataclass | ||
| class VecSVE_BF16(VecSVE): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, can you please elaborate why one needs VecSVE_BF16, but there aren't say VecAVX512_Amx?
IMO architecture flavors and respective build flags should be handled by one class and it's method bool should return true or false on whether architecture is supported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is indeed a separate class for Amx: VecAMX(). We could put the BF16 functionality inside VecSVE but to me that feels like going against the existing pattern and making the code inconsistent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, considering that another PR wants to add 3 variants of VecSVE, do you plan to add 3 more variants of VecSVE128_BF16, VecSVE256_BF16 etc?
|
Removing |
- Updates hashes for:
- PyTorch 114d404b0720e8073748690faeb96449e5c0b229,
2.8.0.dev20250327 from viable/strict
- ideep to 719d8e6cd7f7a0e01b155657526d693acf97c2b3
from ideep_pytorch
- oneDNN to 5de25f354afee38bf2db61f485c729d30f62c611 from main
- Compute Library to 9033bdacdc3840c80762bc56e8facb87b0e1048e,
25.03 release
- OpenBLAS to edef2e4441e50e3a2da1920fdbde09101087c43d from main
- Removes WIP patches which have now landed in the upstream nightly
PyTorch builds.
- Temporarily removes pytorch/pytorch#143666
as it requires a rebase.
- Removes '--tags --force' from git clone command as it adds sinficant
overhead to PyTorch checkout.
- Pins cmake python package version to 3.31 to avoid known issue with
cmake 4.0 - pytorch/pytorch#150167
- Updates hashes for:
- PyTorch 114d404b0720e8073748690faeb96449e5c0b229,
2.8.0.dev20250327 from viable/strict
- ideep to 719d8e6cd7f7a0e01b155657526d693acf97c2b3
from ideep_pytorch
- oneDNN to 5de25f354afee38bf2db61f485c729d30f62c611 from main
- Compute Library to 9033bdacdc3840c80762bc56e8facb87b0e1048e,
25.03 release
- OpenBLAS to edef2e4441e50e3a2da1920fdbde09101087c43d from main
- Removes WIP patches which have now landed in the upstream nightly
PyTorch builds.
- Temporarily removes pytorch/pytorch#143666
as it requires a rebase.
- Removes '--tags --force' from git clone command as it adds significant
overhead to PyTorch checkout.
- Pins cmake python package version to 3.31 to avoid known issue with
cmake 4.0 - pytorch/pytorch#150167
|
@Ryo-not-rio @malfet @swolchok
|
- Following the work in pytorch#119571, BF16 SVE intrinsics are added to the Vectorized class, providing ~1.7x speedup on `silu` and `softmax`. - Added bf16 detection in CMake - Added a guard for native NEON code to prevent compilation errors @aditew01 @maajidkhann please have a look Pull Request resolved: pytorch#143666 Approved by: https://github.com/swolchok, https://github.com/aditew01 Co-authored-by: Aditya Tewari <aditya.tewari@arm.com>
This reverts commit d072254. Reverted pytorch#143666 on behalf of https://github.com/malfet due to I'm unsure why this PR got merged, as it doesn't have a valid review ([comment](pytorch#143666 (comment)))
|
@pytorchbot merge -i |
|
This PR has pending changes requested. Please address the comments and update the PR before merging. |
|
@pytorchbot merge -i |
|
This PR has pending changes requested. Please address the comments and update the PR before merging. |
|
@pytorchbot merge -i |
|
This PR has pending changes requested. Please address the comments and update the PR before merging. |
Looks like it was addressed a while back
|
@pytorchbot merge -i |
Merge startedYour change will be merged while ignoring the following 3 checks: pull / linux-jammy-xpu-2025.0-py3.9 / build, pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu), inductor / unit-test / linux-jammy-cpu-py3.12-gcc11-inductor-halide / build Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
|
Hi, I tested this SVE bfloat16 functionality on my SVE machine and it does inference slower than float32 ( about 1.4x to 2.0x in speed). Is it expected behavior as maybe the logic here is to first convert bfloat16 data back to some old data format? Also, is it possible to add float16 support for SVE instructions as well? Thanks a lot |
Hello, can you share more details about your workload and maybe share a small reproducer? |

siluandsoftmax.@aditew01 @maajidkhann please have a look
cc @H-Huang @awgu @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @malfet @snadampal @milpuz01 @aditew01 @nikhil-arm @fadara01 @EikanWang @voznesenskym @penguinwu @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @chenyang78 @kadeng @muchulee8 @amjames @chauhang @aakhundov @xmfan @kwen2501 @c-p-i-o @yf225 @LucasLLC @MeetVadakkanchery @mhorowitz @pradeepfn @ekr0 @StrongerXi @ColinPeppler @desertfire