-
Notifications
You must be signed in to change notification settings - Fork 11.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clang-18: Aarch64: macos: memset pattern & always_inline attribute prevents copy elision of float constants in Neon code #91863
Comments
@llvm/issue-subscribers-backend-aarch64 Author: None (angushewlett)
clang 18.1 (homebrew) generates memset_pattern16 function calls when assigning a float to multiple Neon f32x4 elements in an array. This causes a serious performance regression in the scenario outlined below.
clang 17 (homebrew) and clang 18 (trunk, 18.1.0rc, aarch64-unknown-linux-gnu) do not do this, and instead perform copy elision which generates much more performant code. The behaviour only seems to happen when attribute((always_inline)) is set. The two output examples below demonstrate the bug. You can see that the second output example is much less performant, due to larger size and calls out to memset etc. clang 17 does not demonstrate this behaviour. Compile with:
Example program:
Output with #define force_inline_unroll 0:
Output with #define force_inline_unroll 1:
|
clang 18.1 (homebrew) generates memset_pattern16 function calls when assigning a float to multiple Neon f32x4 elements in an array. This causes a serious performance regression in the scenario outlined below.
clang 17 (homebrew) and clang 18 (trunk, 18.1.0rc, aarch64-unknown-linux-gnu) do not do this, and instead perform copy elision which generates much more performant code.
The behaviour only seems to happen when attribute((always_inline)) is set.
The two output examples below demonstrate the bug. You can see that the second output example is much less performant, due to larger size and calls out to memset etc.
clang 17 does not demonstrate this behaviour.
Compile with:
Example program:
Output with #define force_inline_unroll 0:
Output with #define force_inline_unroll 1:
The text was updated successfully, but these errors were encountered: