Skip to content

Conversation

okaneco
Copy link
Contributor

@okaneco okaneco commented Oct 7, 2025

  • Refactor the current functionality into a helper function
  • Use as_chunks to encourage auto-vectorization in the optimized chunk processing function
  • Add a codegen test checking for vectorization and no panicking
  • Add benches for eq_ignore_ascii_case

The optimized function is initially only enabled for x86_64 which has sse2 as part of its baseline, but none of the code is platform specific. Other platforms with SIMD instructions may also benefit from this implementation.

Performance improvements only manifest for slices of 16 bytes or longer, so the optimized path is gated behind a length check for greater than or equal to 16.

Benchmarks - Cases below 16 bytes are unaffected, cases above all show sizeable improvements.

before:
    str::eq_ignore_ascii_case::bench_large_str_eq         4942.30ns/iter +/- 48.20
    str::eq_ignore_ascii_case::bench_medium_str_eq         632.01ns/iter +/- 16.87
    str::eq_ignore_ascii_case::bench_str_17_bytes_eq        16.28ns/iter  +/- 0.45
    str::eq_ignore_ascii_case::bench_str_31_bytes_eq        35.23ns/iter  +/- 2.28
    str::eq_ignore_ascii_case::bench_str_of_8_bytes_eq       7.56ns/iter  +/- 0.22
    str::eq_ignore_ascii_case::bench_str_under_8_bytes_eq    2.64ns/iter  +/- 0.06
after:
    str::eq_ignore_ascii_case::bench_large_str_eq         611.63ns/iter +/- 28.29
    str::eq_ignore_ascii_case::bench_medium_str_eq         77.10ns/iter +/- 19.76
    str::eq_ignore_ascii_case::bench_str_17_bytes_eq        3.49ns/iter  +/- 0.39
    str::eq_ignore_ascii_case::bench_str_31_bytes_eq        3.50ns/iter  +/- 0.27
    str::eq_ignore_ascii_case::bench_str_of_8_bytes_eq      7.27ns/iter  +/- 0.09
    str::eq_ignore_ascii_case::bench_str_under_8_bytes_eq   2.60ns/iter  +/- 0.05

Refactor the current functionality into a helper function
Use `as_chunks` to encourage auto-vectorization in the optimized chunk processing function
Add a codegen test
Add benches for `eq_ignore_ascii_case`

The optimized function is initially only enabled for x86_64 which has `sse2` as
part of its baseline, but none of the code is platform specific. Other
platforms with SIMD instructions may also benefit from this implementation.

Performance improvements only manifest for slices of 16 bytes or longer, so the
optimized path is gated behind a length check for greater than or equal to 16.
@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Oct 7, 2025
@rustbot
Copy link
Collaborator

rustbot commented Oct 7, 2025

r? @scottmcm

rustbot has assigned @scottmcm.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

Refactor the eq check into an inner function for reuse in tail checking

Rather than fall back to the simple implementation for tail handling,
load the last 16 bytes to take advantage of vectorization. This doesn't
seem to negatively impact check time even when the remainder count is low.
@okaneco
Copy link
Contributor Author

okaneco commented Oct 7, 2025

I've pushed a commit to avoid falling back to the scalar checking for the remainder handling.

We reload the last 16 bytes of the slices if there's a remainder, which improves the 31 byte case and doesn't seem to regress the 17 byte case.

scalar tail handling
    ascii::eq_ignore_ascii_case::bench_long_str_eq          54.75ns/iter +/- 1.51
    ascii::eq_ignore_ascii_case::bench_str_17_bytes_eq       4.77ns/iter +/- 0.12
    ascii::eq_ignore_ascii_case::bench_str_31_bytes_eq      23.00ns/iter +/- 4.56
    ascii::eq_ignore_ascii_case::bench_str_of_8_bytes_eq     7.61ns/iter +/- 0.16
    ascii::eq_ignore_ascii_case::bench_str_under_8_bytes_eq  2.61ns/iter +/- 0.07
load last 16 bytes of the slice, newest commit
    ascii::eq_ignore_ascii_case::bench_long_str_eq          51.60ns/iter +/- 5.28
    ascii::eq_ignore_ascii_case::bench_str_17_bytes_eq       3.62ns/iter +/- 0.54
    ascii::eq_ignore_ascii_case::bench_str_31_bytes_eq       3.56ns/iter +/- 0.27
    ascii::eq_ignore_ascii_case::bench_str_of_8_bytes_eq     7.79ns/iter +/- 1.01
    ascii::eq_ignore_ascii_case::bench_str_under_8_bytes_eq  2.73ns/iter +/- 0.05

Add #[inline(always)] to inner function and check not for filecheck test
let (other_chunks, _) = other.as_chunks::<N>();

// Branchless check to encourage auto-vectorization
#[inline(always)]
Copy link
Member

@scottmcm scottmcm Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix: you don't need always here. It's a non-generic function which is why it went badly with no attribute, but #[inline] is sufficient.

Suggested change
#[inline(always)]
#[inline]

Copy link
Contributor Author

@okaneco okaneco Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without always, it fails the filecheck test I've written and doesn't seem to inline as shown in the godbolt.

Left is inline(always), right is inline
https://rust.godbolt.org/z/Px6nMPWvK

edit: I realize above I included const generics.
With all generics removed, it still doesn't seem to inline the inner call without always
https://rust.godbolt.org/z/4Tq5hWY6T

Comment on lines +127 to +128
// If there are remaining tails, load the last N bytes in the slices to
// avoid falling back to per-byte checking.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only correct because of a check in the caller that's not documented in the comments on this method nor in the code here.

I would suggest at least having a debug_assert!(self.len() >= N) or similar here to help communicate that invariant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I've been thinking about this since my last push.

I think calling this function with a const generic parameter might be a better choice to encode the information, so I'm going to change it to that along with a debug_assert here.

const fn eq_ignore_ascii_inner(lhs: &[u8; N], rhs: &[u8; N]) -> bool {
let mut equal_ascii = true;
let mut j = 0;
while j < N {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix: there should be const-hack comments on these while loops (or the functions).


// Process the chunks, returning early if an inequality is found
let mut i = 0;
while i < self_chunks.len() && i < other_chunks.len() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it feels wrong that we need to check both lengths here. The answer can never actually differ, right? (Because the caller checks equal-length first anyway.)

The function boundary split feels awkward, since it's losing information -- both in simple and non-simple -- that would be helpful to the optimizer when it comes to removing bounds checks.

What does the actual loop condition look like in llvm or asm here?

Copy link
Contributor Author

@okaneco okaneco Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it feels wrong that we need to check both lengths here. The answer can never actually differ, right? (Because the caller checks equal-length first anyway.)

It does feel wrong, and you are correct about how they can never differ. I included this to eliminate an extra branch and panic in the loop. The comparison gets optimized out so there is no extra length check. I never got around to filing this one as a bug report.

What does the actual loop condition look like in llvm or asm here?

Left is with both checks, right is with only 1
https://rust.godbolt.org/z/5Ks3z9MTM

Difference in the loop condition is this on the right which jumps to a panic

.LBB0_3:
        sub     r8, 1
        jb      .LBB0_15

edit: here's a more minimal version with just the chunk loop https://rust.godbolt.org/z/MGEv3qszY

Add comments for the optimized function invariants to the caller
Add const-hack fixme for using while-loops
Document the invariant for the `_chunks` function
Add a debug assert for the tail handling invariant
@okaneco
Copy link
Contributor Author

okaneco commented Oct 14, 2025

Addressed most of the feedback:

  • Added comments about the invariants to the caller function
  • Added const-hack comments for using while-loops
  • Documented the invariant for the _chunks function in the function documentation
  • Added a debug assert for the tail handling invariant

I changed the _chunks function to take a const generic usize parameter to bubble up the invariant to the caller.


There's one other possibly missed optimization (if it's correct) in the tail handling portion, but it requires unsafe. It's a small difference in instructions.
Not sure if there's a safe way to golf this.
https://rust.godbolt.org/z/G1fPnE881

if let (Some(a_rem), Some(b_rem)) = (s.last_chunk::<N>(), other.last_chunk::<N>()) {
    if !eq_ignore_ascii_inner(a_rem, b_rem) {
        return false;
    }
}

// Can be rewritten as...
let a_rem: &[u8; N] = unsafe { &*s.as_ptr().add(s.len() - N).cast() };
let b_rem: &[u8; N] = unsafe { &*other.as_ptr().add(other.len() - N).cast() };
if !eq_ignore_ascii_inner(a_rem, b_rem) {
    return false;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. T-libs Relevant to the library team, which will review and decide on the PR/issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants