slice/ascii: Optimize eq_ignore_ascii_case with auto-vectorization#147436
slice/ascii: Optimize eq_ignore_ascii_case with auto-vectorization#147436rust-bors[bot] merged 4 commits intorust-lang:mainfrom
eq_ignore_ascii_case with auto-vectorization#147436Conversation
Refactor the current functionality into a helper function Use `as_chunks` to encourage auto-vectorization in the optimized chunk processing function Add a codegen test Add benches for `eq_ignore_ascii_case` The optimized function is initially only enabled for x86_64 which has `sse2` as part of its baseline, but none of the code is platform specific. Other platforms with SIMD instructions may also benefit from this implementation. Performance improvements only manifest for slices of 16 bytes or longer, so the optimized path is gated behind a length check for greater than or equal to 16.
Refactor the eq check into an inner function for reuse in tail checking Rather than fall back to the simple implementation for tail handling, load the last 16 bytes to take advantage of vectorization. This doesn't seem to negatively impact check time even when the remainder count is low.
|
I've pushed a commit to avoid falling back to the scalar checking for the remainder handling. We reload the last 16 bytes of the slices if there's a remainder, which improves the 31 byte case and doesn't seem to regress the 17 byte case. |
Add #[inline(always)] to inner function and check not for filecheck test
| let (other_chunks, _) = other.as_chunks::<N>(); | ||
|
|
||
| // Branchless check to encourage auto-vectorization | ||
| #[inline(always)] |
There was a problem hiding this comment.
fix: you don't need always here. It's a non-generic function which is why it went badly with no attribute, but #[inline] is sufficient.
| #[inline(always)] | |
| #[inline] |
There was a problem hiding this comment.
Without always, it fails the filecheck test I've written and doesn't seem to inline as shown in the godbolt.
Left is inline(always), right is inline
https://rust.godbolt.org/z/Px6nMPWvK
edit: I realize above I included const generics.
With all generics removed, it still doesn't seem to inline the inner call without always
https://rust.godbolt.org/z/4Tq5hWY6T
| // If there are remaining tails, load the last N bytes in the slices to | ||
| // avoid falling back to per-byte checking. |
There was a problem hiding this comment.
This is only correct because of a check in the caller that's not documented in the comments on this method nor in the code here.
I would suggest at least having a debug_assert!(self.len() >= N) or similar here to help communicate that invariant.
There was a problem hiding this comment.
Agreed, I've been thinking about this since my last push.
I think calling this function with a const generic parameter might be a better choice to encode the information, so I'm going to change it to that along with a debug_assert here.
library/core/src/slice/ascii.rs
Outdated
| const fn eq_ignore_ascii_inner(lhs: &[u8; N], rhs: &[u8; N]) -> bool { | ||
| let mut equal_ascii = true; | ||
| let mut j = 0; | ||
| while j < N { |
There was a problem hiding this comment.
fix: there should be const-hack comments on these while loops (or the functions).
|
|
||
| // Process the chunks, returning early if an inequality is found | ||
| let mut i = 0; | ||
| while i < self_chunks.len() && i < other_chunks.len() { |
There was a problem hiding this comment.
Hmm, it feels wrong that we need to check both lengths here. The answer can never actually differ, right? (Because the caller checks equal-length first anyway.)
The function boundary split feels awkward, since it's losing information -- both in simple and non-simple -- that would be helpful to the optimizer when it comes to removing bounds checks.
What does the actual loop condition look like in llvm or asm here?
There was a problem hiding this comment.
Hmm, it feels wrong that we need to check both lengths here. The answer can never actually differ, right? (Because the caller checks equal-length first anyway.)
It does feel wrong, and you are correct about how they can never differ. I included this to eliminate an extra branch and panic in the loop. The comparison gets optimized out so there is no extra length check. I never got around to filing this one as a bug report.
What does the actual loop condition look like in llvm or asm here?
Left is with both checks, right is with only 1
https://rust.godbolt.org/z/5Ks3z9MTM
Difference in the loop condition is this on the right which jumps to a panic
.LBB0_3:
sub r8, 1
jb .LBB0_15edit: here's a more minimal version with just the chunk loop https://rust.godbolt.org/z/MGEv3qszY
There was a problem hiding this comment.
This might be the classic problem of function boundaries really mattering to optimizations. If you look at
rust/library/core/src/slice/mod.rs
Lines 1005 to 1011 in b3cda16
for example that just adds an assert to let this stuff optimize out, then that assert optimizes out later.
Add comments for the optimized function invariants to the caller Add const-hack fixme for using while-loops Document the invariant for the `_chunks` function Add a debug assert for the tail handling invariant
|
Addressed most of the feedback:
I changed the There's one other possibly missed optimization (if it's correct) in the tail handling portion, but it requires unsafe. It's a small difference in instructions. if let (Some(a_rem), Some(b_rem)) = (s.last_chunk::<N>(), other.last_chunk::<N>()) {
if !eq_ignore_ascii_inner(a_rem, b_rem) {
return false;
}
}
// Can be rewritten as...
let a_rem: &[u8; N] = unsafe { &*s.as_ptr().add(s.len() - N).cast() };
let b_rem: &[u8; N] = unsafe { &*other.as_ptr().add(other.len() - N).cast() };
if !eq_ignore_ascii_inner(a_rem, b_rem) {
return false;
} |
|
Thanks for the updates. I think this can land as-is -- we can always tweak it more later if needed. @bors r+ rollup=iffy (new codegen test is always scary) |
…=scottmcm
slice/ascii: Optimize `eq_ignore_ascii_case` with auto-vectorization
- Refactor the current functionality into a helper function
- Use `as_chunks` to encourage auto-vectorization in the optimized chunk processing function
- Add a codegen test checking for vectorization and no panicking
- Add benches for `eq_ignore_ascii_case`
---
The optimized function is initially only enabled for x86_64 which has `sse2` as part of its baseline, but none of the code is platform specific. Other platforms with SIMD instructions may also benefit from this implementation.
Performance improvements only manifest for slices of 16 bytes or longer, so the optimized path is gated behind a length check for greater than or equal to 16.
Benchmarks - Cases below 16 bytes are unaffected, cases above all show sizeable improvements.
```
before:
str::eq_ignore_ascii_case::bench_large_str_eq 4942.30ns/iter +/- 48.20
str::eq_ignore_ascii_case::bench_medium_str_eq 632.01ns/iter +/- 16.87
str::eq_ignore_ascii_case::bench_str_17_bytes_eq 16.28ns/iter +/- 0.45
str::eq_ignore_ascii_case::bench_str_31_bytes_eq 35.23ns/iter +/- 2.28
str::eq_ignore_ascii_case::bench_str_of_8_bytes_eq 7.56ns/iter +/- 0.22
str::eq_ignore_ascii_case::bench_str_under_8_bytes_eq 2.64ns/iter +/- 0.06
after:
str::eq_ignore_ascii_case::bench_large_str_eq 611.63ns/iter +/- 28.29
str::eq_ignore_ascii_case::bench_medium_str_eq 77.10ns/iter +/- 19.76
str::eq_ignore_ascii_case::bench_str_17_bytes_eq 3.49ns/iter +/- 0.39
str::eq_ignore_ascii_case::bench_str_31_bytes_eq 3.50ns/iter +/- 0.27
str::eq_ignore_ascii_case::bench_str_of_8_bytes_eq 7.27ns/iter +/- 0.09
str::eq_ignore_ascii_case::bench_str_under_8_bytes_eq 2.60ns/iter +/- 0.05
```
…=scottmcm
slice/ascii: Optimize `eq_ignore_ascii_case` with auto-vectorization
- Refactor the current functionality into a helper function
- Use `as_chunks` to encourage auto-vectorization in the optimized chunk processing function
- Add a codegen test checking for vectorization and no panicking
- Add benches for `eq_ignore_ascii_case`
---
The optimized function is initially only enabled for x86_64 which has `sse2` as part of its baseline, but none of the code is platform specific. Other platforms with SIMD instructions may also benefit from this implementation.
Performance improvements only manifest for slices of 16 bytes or longer, so the optimized path is gated behind a length check for greater than or equal to 16.
Benchmarks - Cases below 16 bytes are unaffected, cases above all show sizeable improvements.
```
before:
str::eq_ignore_ascii_case::bench_large_str_eq 4942.30ns/iter +/- 48.20
str::eq_ignore_ascii_case::bench_medium_str_eq 632.01ns/iter +/- 16.87
str::eq_ignore_ascii_case::bench_str_17_bytes_eq 16.28ns/iter +/- 0.45
str::eq_ignore_ascii_case::bench_str_31_bytes_eq 35.23ns/iter +/- 2.28
str::eq_ignore_ascii_case::bench_str_of_8_bytes_eq 7.56ns/iter +/- 0.22
str::eq_ignore_ascii_case::bench_str_under_8_bytes_eq 2.64ns/iter +/- 0.06
after:
str::eq_ignore_ascii_case::bench_large_str_eq 611.63ns/iter +/- 28.29
str::eq_ignore_ascii_case::bench_medium_str_eq 77.10ns/iter +/- 19.76
str::eq_ignore_ascii_case::bench_str_17_bytes_eq 3.49ns/iter +/- 0.39
str::eq_ignore_ascii_case::bench_str_31_bytes_eq 3.50ns/iter +/- 0.27
str::eq_ignore_ascii_case::bench_str_of_8_bytes_eq 7.27ns/iter +/- 0.09
str::eq_ignore_ascii_case::bench_str_under_8_bytes_eq 2.60ns/iter +/- 0.05
```
Rollup of 5 pull requests Successful merges: - #151692 (Try to reduce rustdoc GUI tests flakyness) - #147436 (slice/ascii: Optimize `eq_ignore_ascii_case` with auto-vectorization) - #151390 (Reintroduce `QueryStackFrame` split.) - #151097 (Use an associated type default for `Key::Cache`.) - #151702 (Omit standard copyright notice)
Rollup merge of #147436 - okaneco:eq_ignore_ascii_autovec, r=scottmcm slice/ascii: Optimize `eq_ignore_ascii_case` with auto-vectorization - Refactor the current functionality into a helper function - Use `as_chunks` to encourage auto-vectorization in the optimized chunk processing function - Add a codegen test checking for vectorization and no panicking - Add benches for `eq_ignore_ascii_case` --- The optimized function is initially only enabled for x86_64 which has `sse2` as part of its baseline, but none of the code is platform specific. Other platforms with SIMD instructions may also benefit from this implementation. Performance improvements only manifest for slices of 16 bytes or longer, so the optimized path is gated behind a length check for greater than or equal to 16. Benchmarks - Cases below 16 bytes are unaffected, cases above all show sizeable improvements. ``` before: str::eq_ignore_ascii_case::bench_large_str_eq 4942.30ns/iter +/- 48.20 str::eq_ignore_ascii_case::bench_medium_str_eq 632.01ns/iter +/- 16.87 str::eq_ignore_ascii_case::bench_str_17_bytes_eq 16.28ns/iter +/- 0.45 str::eq_ignore_ascii_case::bench_str_31_bytes_eq 35.23ns/iter +/- 2.28 str::eq_ignore_ascii_case::bench_str_of_8_bytes_eq 7.56ns/iter +/- 0.22 str::eq_ignore_ascii_case::bench_str_under_8_bytes_eq 2.64ns/iter +/- 0.06 after: str::eq_ignore_ascii_case::bench_large_str_eq 611.63ns/iter +/- 28.29 str::eq_ignore_ascii_case::bench_medium_str_eq 77.10ns/iter +/- 19.76 str::eq_ignore_ascii_case::bench_str_17_bytes_eq 3.49ns/iter +/- 0.39 str::eq_ignore_ascii_case::bench_str_31_bytes_eq 3.50ns/iter +/- 0.27 str::eq_ignore_ascii_case::bench_str_of_8_bytes_eq 7.27ns/iter +/- 0.09 str::eq_ignore_ascii_case::bench_str_under_8_bytes_eq 2.60ns/iter +/- 0.05 ```
Rollup of 5 pull requests Successful merges: - rust-lang/rust#151692 (Try to reduce rustdoc GUI tests flakyness) - rust-lang/rust#147436 (slice/ascii: Optimize `eq_ignore_ascii_case` with auto-vectorization) - rust-lang/rust#151390 (Reintroduce `QueryStackFrame` split.) - rust-lang/rust#151097 (Use an associated type default for `Key::Cache`.) - rust-lang/rust#151702 (Omit standard copyright notice)
Rollup of 5 pull requests Successful merges: - rust-lang/rust#151692 (Try to reduce rustdoc GUI tests flakyness) - rust-lang/rust#147436 (slice/ascii: Optimize `eq_ignore_ascii_case` with auto-vectorization) - rust-lang/rust#151390 (Reintroduce `QueryStackFrame` split.) - rust-lang/rust#151097 (Use an associated type default for `Key::Cache`.) - rust-lang/rust#151702 (Omit standard copyright notice)
as_chunksto encourage auto-vectorization in the optimized chunk processing functioneq_ignore_ascii_caseThe optimized function is initially only enabled for x86_64 which has
sse2as part of its baseline, but none of the code is platform specific. Other platforms with SIMD instructions may also benefit from this implementation.Performance improvements only manifest for slices of 16 bytes or longer, so the optimized path is gated behind a length check for greater than or equal to 16.
Benchmarks - Cases below 16 bytes are unaffected, cases above all show sizeable improvements.