-
Notifications
You must be signed in to change notification settings - Fork 13.8k
slice/ascii: Optimize eq_ignore_ascii_case
with auto-vectorization
#147436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Refactor the current functionality into a helper function Use `as_chunks` to encourage auto-vectorization in the optimized chunk processing function Add a codegen test Add benches for `eq_ignore_ascii_case` The optimized function is initially only enabled for x86_64 which has `sse2` as part of its baseline, but none of the code is platform specific. Other platforms with SIMD instructions may also benefit from this implementation. Performance improvements only manifest for slices of 16 bytes or longer, so the optimized path is gated behind a length check for greater than or equal to 16.
Refactor the eq check into an inner function for reuse in tail checking Rather than fall back to the simple implementation for tail handling, load the last 16 bytes to take advantage of vectorization. This doesn't seem to negatively impact check time even when the remainder count is low.
I've pushed a commit to avoid falling back to the scalar checking for the remainder handling. We reload the last 16 bytes of the slices if there's a remainder, which improves the 31 byte case and doesn't seem to regress the 17 byte case.
|
Add #[inline(always)] to inner function and check not for filecheck test
let (other_chunks, _) = other.as_chunks::<N>(); | ||
|
||
// Branchless check to encourage auto-vectorization | ||
#[inline(always)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix: you don't need always
here. It's a non-generic function which is why it went badly with no attribute, but #[inline]
is sufficient.
#[inline(always)] | |
#[inline] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without always
, it fails the filecheck test I've written and doesn't seem to inline as shown in the godbolt.
Left is inline(always)
, right is inline
https://rust.godbolt.org/z/Px6nMPWvK
edit: I realize above I included const generics.
With all generics removed, it still doesn't seem to inline the inner call without always
https://rust.godbolt.org/z/4Tq5hWY6T
// If there are remaining tails, load the last N bytes in the slices to | ||
// avoid falling back to per-byte checking. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only correct because of a check in the caller that's not documented in the comments on this method nor in the code here.
I would suggest at least having a debug_assert!(self.len() >= N)
or similar here to help communicate that invariant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I've been thinking about this since my last push.
I think calling this function with a const generic parameter might be a better choice to encode the information, so I'm going to change it to that along with a debug_assert
here.
library/core/src/slice/ascii.rs
Outdated
const fn eq_ignore_ascii_inner(lhs: &[u8; N], rhs: &[u8; N]) -> bool { | ||
let mut equal_ascii = true; | ||
let mut j = 0; | ||
while j < N { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix: there should be const-hack
comments on these while
loops (or the functions).
|
||
// Process the chunks, returning early if an inequality is found | ||
let mut i = 0; | ||
while i < self_chunks.len() && i < other_chunks.len() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it feels wrong that we need to check both lengths here. The answer can never actually differ, right? (Because the caller checks equal-length first anyway.)
The function boundary split feels awkward, since it's losing information -- both in simple and non-simple -- that would be helpful to the optimizer when it comes to removing bounds checks.
What does the actual loop condition look like in llvm or asm here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it feels wrong that we need to check both lengths here. The answer can never actually differ, right? (Because the caller checks equal-length first anyway.)
It does feel wrong, and you are correct about how they can never differ. I included this to eliminate an extra branch and panic in the loop. The comparison gets optimized out so there is no extra length check. I never got around to filing this one as a bug report.
What does the actual loop condition look like in llvm or asm here?
Left is with both checks, right is with only 1
https://rust.godbolt.org/z/5Ks3z9MTM
Difference in the loop condition is this on the right which jumps to a panic
.LBB0_3:
sub r8, 1
jb .LBB0_15
edit: here's a more minimal version with just the chunk loop https://rust.godbolt.org/z/MGEv3qszY
Add comments for the optimized function invariants to the caller Add const-hack fixme for using while-loops Document the invariant for the `_chunks` function Add a debug assert for the tail handling invariant
Addressed most of the feedback:
I changed the There's one other possibly missed optimization (if it's correct) in the tail handling portion, but it requires unsafe. It's a small difference in instructions. if let (Some(a_rem), Some(b_rem)) = (s.last_chunk::<N>(), other.last_chunk::<N>()) {
if !eq_ignore_ascii_inner(a_rem, b_rem) {
return false;
}
}
// Can be rewritten as...
let a_rem: &[u8; N] = unsafe { &*s.as_ptr().add(s.len() - N).cast() };
let b_rem: &[u8; N] = unsafe { &*other.as_ptr().add(other.len() - N).cast() };
if !eq_ignore_ascii_inner(a_rem, b_rem) {
return false;
} |
as_chunks
to encourage auto-vectorization in the optimized chunk processing functioneq_ignore_ascii_case
The optimized function is initially only enabled for x86_64 which has
sse2
as part of its baseline, but none of the code is platform specific. Other platforms with SIMD instructions may also benefit from this implementation.Performance improvements only manifest for slices of 16 bytes or longer, so the optimized path is gated behind a length check for greater than or equal to 16.
Benchmarks - Cases below 16 bytes are unaffected, cases above all show sizeable improvements.