Skip to content

Conversation

@Ryo-not-rio
Copy link
Collaborator

@Ryo-not-rio Ryo-not-rio commented Dec 20, 2024

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 20, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/143666

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 2 Unrelated Failures

As of commit 0e90e42 with merge base 783f045 (image):

NEW FAILURE - The following job has failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added module: cpu CPU specific problem (e.g., perf, algorithm) module: inductor labels Dec 20, 2024
@aditew01 aditew01 added the module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 label Dec 20, 2024
@aditew01 aditew01 requested a review from swolchok December 20, 2024 20:29
@aditew01
Copy link
Collaborator

@pytorchbot label "topic: not user facing"

@pytorch-bot pytorch-bot bot added the topic: not user facing topic category label Dec 20, 2024
@aditew01 aditew01 added the ciflow/linux-aarch64 linux aarch64 CI workflow label Dec 20, 2024
@pytorch-bot
Copy link

pytorch-bot bot commented Dec 20, 2024

To add the ciflow label ciflow/linux-aarch64 please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot pytorch-bot bot removed the ciflow/linux-aarch64 linux aarch64 CI workflow label Dec 20, 2024
@zou3519 zou3519 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Dec 23, 2024
Copy link
Contributor

@swolchok swolchok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not particularly familiar with SVE, but fairly sure I found a bug, see inline comments

Comment on lines +256 to +239
auto bf16_vec1 = svzip1_bf16(zero, a);
auto bf16_vec2 = svzip2_bf16(zero, a);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would naively expect reinterpreting as u16 and doing a widening left shift to be better, but I have no particular hardware performance chart I'm referencing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a possible alternative implementation, I've left it as is for now as it is the simplest way I've come across

@@ -0,0 +1,857 @@
#pragma once
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

splitting fp16 out of vec256_bfloat16.h would be much easier to review as its own stacked PR

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this has to be in this PR as vec256_bfloat16.h has to be guarded separately in vec256.h

Copy link
Contributor

@swolchok swolchok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another comparison bug (I think)

v2 = v2.rsqrt();
return convert_float_bfloat16(v1, v2);
auto [v3, v4] = convert_bfloat16_float(other);
return convert_float_bfloat16(v1 > v3, v2 > v4);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might cause problems -- the masks returned by comparison operators are NaNs if set, and could end up being canonicalized by the conversion. https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec128/vec128_bfloat16_neon.h#L218 please make sure tests cover this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand what you mean, I see that the comparison operators are covered in vec_test_all_types.cpp and they pass so I'm assuming it is okay?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, you are right that they're covered. If the conversion is implemented with integer shift instructions then it's going to be fine; if there's a hardware instruction that isn't actually just a left shift then it might do NaN canonicalization and wouldn't be fine. The existence of a hardware instruction that isn't just a left shift would be weird, so I guess this must be fine.

Copy link
Contributor

@swolchok swolchok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need maintainer approval to actually merge this, but I personally have no more complaints. Thanks!

@Ryo-not-rio
Copy link
Collaborator Author

@pytorchbot label "arm priority"

@Ryo-not-rio
Copy link
Collaborator Author

@swolchok would you be able to trigger the pipelines by adding the "ciflow/linux-aarch64" label? Thanks!

@Ryo-not-rio
Copy link
Collaborator Author

@malfet could you have a look at this PR please? Thanks!

@swolchok swolchok added the ciflow/linux-aarch64 linux aarch64 CI workflow label Jan 27, 2025
@pytorch-bot
Copy link

pytorch-bot bot commented Jan 27, 2025

To add the ciflow label ciflow/linux-aarch64 please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot pytorch-bot bot removed the ciflow/linux-aarch64 linux aarch64 CI workflow label Jan 27, 2025
@malfet malfet added the ciflow/trunk Trigger trunk jobs on your pull request label Jan 27, 2025
Copy link
Contributor

@malfet malfet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a big PR and I have not looked thru it all, but VecSVE_BF16 class and how one shifts detection on whether ISA is supported from the class down the callstack feels wrong to me. If you disagree, please elaborate why this can't be incorporate into an existing VecSVE class?

Also, as suggested by @swolchok do you mind splitting this one into a series of smaller PRs for the ease of reviews (because it feels like a mix of refactoring and feature work.)

And is SVE_BF16 supported on Graviton3? (I.e. are those perf gains should be observable by the benchmark CI is running?

TORCH_API bool is_arm_sve_supported();

// Detect if CPU supports Arm(R) architecture SVE ISA and BF16
TORCH_API bool is_arm_sve_bf16_supported();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, but re-exporting cpuinfo APIs this way looks a bit like an anti-pattern...

cpu.def("_is_amx_fp16_supported", at::cpu::is_amx_fp16_supported);
cpu.def("_init_amx", at::cpu::init_amx);
cpu.def("_is_arm_sve_supported", at::cpu::is_arm_sve_supported);
cpu.def("_is_arm_sve_bf16_supported", at::cpu::is_arm_sve_bf16_supported);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this looks like an anti-pattern. Perhaps a better API would be to return a named dict with all CPU properties...



@dataclasses.dataclass
class VecSVE_BF16(VecSVE):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, can you please elaborate why one needs VecSVE_BF16, but there aren't say VecAVX512_Amx?
IMO architecture flavors and respective build flags should be handled by one class and it's method bool should return true or false on whether architecture is supported

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is indeed a separate class for Amx: VecAMX(). We could put the BF16 functionality inside VecSVE but to me that feels like going against the existing pattern and making the code inconsistent

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, considering that another PR wants to add 3 variants of VecSVE, do you plan to add 3 more variants of VecSVE128_BF16, VecSVE256_BF16 etc?

@malfet
Copy link
Contributor

malfet commented Feb 4, 2025

Removing arm priority as PR has 32 CI failures and conflicts with trunk. Please rebase, fix the failure and than let's have another round of reviews.

nSircombe added a commit to nSircombe/Tool-Solutions that referenced this pull request Apr 2, 2025
 - Updates hashes for:
   - PyTorch 114d404b0720e8073748690faeb96449e5c0b229,
     2.8.0.dev20250327 from viable/strict
   - ideep to 719d8e6cd7f7a0e01b155657526d693acf97c2b3
     from ideep_pytorch
   - oneDNN to 5de25f354afee38bf2db61f485c729d30f62c611 from main
   - Compute Library to 9033bdacdc3840c80762bc56e8facb87b0e1048e,
     25.03 release
   - OpenBLAS to edef2e4441e50e3a2da1920fdbde09101087c43d from main
 - Removes WIP patches which have now landed in the upstream nightly
PyTorch builds.
 - Temporarily removes  pytorch/pytorch#143666
   as it requires a rebase.
 - Removes '--tags --force' from git clone command as it adds sinficant
   overhead to PyTorch checkout.
 - Pins cmake python package version to 3.31 to avoid known issue with
   cmake 4.0 - pytorch/pytorch#150167
nSircombe added a commit to nSircombe/Tool-Solutions that referenced this pull request Apr 2, 2025
 - Updates hashes for:
   - PyTorch 114d404b0720e8073748690faeb96449e5c0b229,
     2.8.0.dev20250327 from viable/strict
   - ideep to 719d8e6cd7f7a0e01b155657526d693acf97c2b3
     from ideep_pytorch
   - oneDNN to 5de25f354afee38bf2db61f485c729d30f62c611 from main
   - Compute Library to 9033bdacdc3840c80762bc56e8facb87b0e1048e,
     25.03 release
   - OpenBLAS to edef2e4441e50e3a2da1920fdbde09101087c43d from main
 - Removes WIP patches which have now landed in the upstream nightly
PyTorch builds.
 - Temporarily removes  pytorch/pytorch#143666
   as it requires a rebase.
 - Removes '--tags --force' from git clone command as it adds significant
   overhead to PyTorch checkout.
 - Pins cmake python package version to 3.31 to avoid known issue with
   cmake 4.0 - pytorch/pytorch#150167
@aditew01
Copy link
Collaborator

aditew01 commented Apr 8, 2025

@Ryo-not-rio @malfet @swolchok
Link to overall benchmark from TorchInductor Performance DashBoard
It's worth highlighting that the Avg execution times shows a general improvement for all the models. Attaching a screenshot from the Huggingface results.

Perf-speedup may not be the best metric here to compare the results, given this path will show improvements in eager as well as compile mode.
Screenshot 2025-04-08 at 10 07 17

@aditew01 aditew01 requested a review from malfet April 8, 2025 09:29
@nikhil-arm nikhil-arm requested a review from digantdesai April 11, 2025 15:13
amathewc pushed a commit to amathewc/pytorch that referenced this pull request Apr 17, 2025
- Following the work in pytorch#119571, BF16 SVE intrinsics are added to the Vectorized class, providing ~1.7x speedup on `silu` and `softmax`.
- Added bf16 detection in CMake
- Added a guard for native NEON code to prevent compilation errors

@aditew01 @maajidkhann please have a look

Pull Request resolved: pytorch#143666
Approved by: https://github.com/swolchok, https://github.com/aditew01

Co-authored-by: Aditya Tewari <aditya.tewari@arm.com>
amathewc pushed a commit to amathewc/pytorch that referenced this pull request Apr 17, 2025
This reverts commit d072254.

Reverted pytorch#143666 on behalf of https://github.com/malfet due to I'm unsure why this PR got merged, as it doesn't have a valid review ([comment](pytorch#143666 (comment)))
@malfet
Copy link
Contributor

malfet commented Apr 28, 2025

@pytorchbot merge -i

@pytorch-bot
Copy link

pytorch-bot bot commented Apr 28, 2025

This PR has pending changes requested. Please address the comments and update the PR before merging.

@malfet
Copy link
Contributor

malfet commented Apr 28, 2025

@pytorchbot merge -i

@pytorch-bot
Copy link

pytorch-bot bot commented Apr 28, 2025

This PR has pending changes requested. Please address the comments and update the PR before merging.

@malfet
Copy link
Contributor

malfet commented Apr 28, 2025

@pytorchbot merge -i

@pytorch-bot
Copy link

pytorch-bot bot commented Apr 28, 2025

This PR has pending changes requested. Please address the comments and update the PR before merging.

@malfet malfet dismissed swolchok’s stale review April 28, 2025 18:16

Looks like it was addressed a while back

@malfet
Copy link
Contributor

malfet commented Apr 28, 2025

@pytorchbot merge -i

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged while ignoring the following 3 checks: pull / linux-jammy-xpu-2025.0-py3.9 / build, pull / cuda12.4-py3.10-gcc9-sm75 / test (pr_time_benchmarks, 1, 1, linux.g4dn.metal.nvidia.gpu), inductor / unit-test / linux-jammy-cpu-py3.12-gcc11-inductor-halide / build

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@luentong
Copy link

Hi, I tested this SVE bfloat16 functionality on my SVE machine and it does inference slower than float32 ( about 1.4x to 2.0x in speed). Is it expected behavior as maybe the logic here is to first convert bfloat16 data back to some old data format? Also, is it possible to add float16 support for SVE instructions as well? Thanks a lot

@nikhil-arm
Copy link
Collaborator

Hi, I tested this SVE bfloat16 functionality on my SVE machine and it does inference slower than float32 ( about 1.4x to 2.0x in speed). Is it expected behavior as maybe the logic here is to first convert bfloat16 data back to some old data format? Also, is it possible to add float16 support for SVE instructions as well? Thanks a lot

Hello, can you share more details about your workload and maybe share a small reproducer?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arm priority ci-no-td Do not run TD on this PR ciflow/inductor ciflow/mps Run MPS tests (subset of trunk) ciflow/trunk Trigger trunk jobs on your pull request Merged module: arm Related to ARM architectures builds of PyTorch. Includes Apple M1 module: compiled autograd compiled_autograd module: cpu CPU specific problem (e.g., perf, algorithm) module: dynamo module: inductor NNC oncall: distributed Add this issue/PR to distributed oncall triage queue open source release notes: quantization release notes category Reverted topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.