Skip to content

Faster UTF-8 string validation #111367

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

oliver-giersch
Copy link
Contributor

This PR rewrites much of the UTF-8 validation code, optimizing it in particular for validating ASCII strings. This does not mean, there is an artificial bias towards pure ASCII strings, just that this path is particularly optimized. These optimizations come from 2 primary directions:

  • Checking larger blocks (up to 8 words at a time) and the code for checking blocks is deliberately written to facilitate auto-vectorization using up to 512-bit SIMD instructions (on x86-64).
  • Reducing the amount of superfluous checks. E.g., in the current version, when a non-ASCII byte is detected in a (at most 2-word) block, all bytes in that block are checked individually again, meaning up to 2 * mem::size_of::<usize>() - 1 have to be checked twice. In the new version, the same block is checked again using word-wise operations, so that fewer bytes have to be checked individually. This is particularly important due to the larger block sizes.

The code for checking detected non-ASCII bytes/byte-sequences remains relatively unaltered, so now great differences are to be expected for non-ASCII strings. I evaluated the changes using 6 benchmarks, validating different string sizes ranging from 27 to ~200k bytes, the results for which can be inspected here. The best performance improvement is ~2x faster, the worst is ~17.5% slower (for short strings on a Raspberry Pi 4 with a relatively small regression in absolute terms).

Since UTF-8 validation is a pretty fundamental operation in Rust, these changes should ideally be evaluated on more platforms, since I could only run these benchmarks on systems I had access to, which were 3 different x86-64 systems and one aarch64. Any help here would definitely appreciated, the benchmarking code can be found here, in a separate repo.

Note: I found, that the performance on aarch64 could be further improved by using a specialized version of the has_non_ascii_byte function like this:

#[cfg(not(target_feature = "avx"))]
#[inline(always)]
const fn has_non_ascii_byte<const N: usize>(block: &[usize; N]) -> bool {
    let mut i = 0;
    while i < N {
        if block[i] & NONASCII_MASK > 0 {
            return true;
        }
        i += 1;
    }

    false
}

I eventually decided against including this in the PR, since it is fairly arbitrary and overall probably dependent on which optimization passes are implemented differently by LLVM for different architectures and I have no idea how this would affect other platforms I have not myself evaluated.

@rustbot
Copy link
Collaborator

rustbot commented May 8, 2023

r? @joshtriplett

(rustbot has picked a reviewer for you, use r? to override)

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels May 8, 2023
@rustbot
Copy link
Collaborator

rustbot commented May 8, 2023

Hey! It looks like you've submitted a new PR for the library teams!

If this PR contains changes to any rust-lang/rust public library APIs then please comment with @rustbot label +T-libs-api -T-libs to tag it appropriately. If this PR contains changes to any unstable APIs please edit the PR description to add a link to the relevant API Change Proposal or create one if you haven't already. If you're unsure where your change falls no worries, just leave it as is and the reviewer will take a look and make a decision to forward on if necessary.

Examples of T-libs-api changes:

  • Stabilizing library features
  • Introducing insta-stable changes such as new implementations of existing stable traits on existing stable types
  • Introducing new or changing existing unstable library APIs (excluding permanently unstable features / features without a tracking issue)
  • Changing public documentation in ways that create new stability guarantees
  • Changing observable runtime behavior of library APIs

@workingjubilee
Copy link
Member

Checking larger blocks (up to 8 words at a time) and the code for checking blocks is deliberately written to facilitate auto-vectorization using up to 512-bit SIMD instructions (on x86-64).

Be aware that the standard library is precompiled, so many operations which use UTF-8 validation in the standard library will simply not be vectorized to use AVX512 instructions. Instead, they will use SSE2 instructions, or whatever else is in the target definition.

@bors try @rust-timer queue

@rust-timer
Copy link
Collaborator

Awaiting bors try build completion.

@rustbot label: +S-waiting-on-perf

@rustbot rustbot added the S-waiting-on-perf Status: Waiting on a perf run to be completed. label May 8, 2023
@bors
Copy link
Collaborator

bors commented May 8, 2023

⌛ Trying commit eab17e7 with merge 91111e127332600ac24a69ec23df526b5579a3ef...

@rust-log-analyzer

This comment has been minimized.

@workingjubilee
Copy link
Member

Oh dear. Well, I suppose this may be a very short bors try.

@workingjubilee
Copy link
Member

I hope you don't mind me lightening your load slightly, Josh. 🙏

r? @workingjubilee

@rust-log-analyzer

This comment has been minimized.

@bors
Copy link
Collaborator

bors commented May 8, 2023

💔 Test failed - checks-actions

@bors bors added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels May 8, 2023
@workingjubilee workingjubilee removed the S-waiting-on-perf Status: Waiting on a perf run to be completed. label May 8, 2023
@rust-log-analyzer

This comment has been minimized.

} else {
// byte is < 128 (ASCII), but pointer is not word-aligned, skip
// until the loop reaches the next word-aligned block)
for _ in 0..offset {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can just use a while like the rest of these cases, as we've removed ~const traits in the rest of std.

Suggested change
for _ in 0..offset {
let mut i = 0;
while i < offset {
i += 1;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, sorry for missing that one, Fixed in the latest commit.

@the8472
Copy link
Member

the8472 commented May 8, 2023

Note that there's another PR open #107760

Copy link
Member

@workingjubilee workingjubilee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried benching a simpler version that doesn't use the 4-word and 8-word blocks, or indeed just skips "words" altogether, and instead uses 16-byte alignment unconditionally? We have several 32-bit targets that still have the capability to use 128-bit hardware vectors and still prefer vector alignment when doing so. This kind of ASCII-specific fast-path should be able to do more than a 2x speedup, though it might be hard to accomplish when working mostly with autovectorization.

Comment on lines +277 to +280
3 => {
let Some([_, b1, b2]) = subarray(buf, curr) else {
err!(None);
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to take the previous suggestion.

Suggested change
3 => {
let Some([_, b1, b2]) = subarray(buf, curr) else {
err!(None);
};
Some([_, b1, b2]) => {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Sorry I screwed up and accidentally applied the initial suggestion instead of commenting)

This would not compile, since a single call to the (const-generic) subarray can not return arrays of different sizes

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, you're right.

...I'm not sure just slicing wouldn't be better, but fair.

Comment on lines +325 to +341
let mut vector = [0; N];

let mut i = 0;
while i < N {
vector[i] = block[i] & NONASCII_MASK;
i += 1;
}

i = 0;
while i < N {
if vector[i] > 0 {
return true;
}
i += 1;
}

false

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let mut vector = [0; N];
let mut i = 0;
while i < N {
vector[i] = block[i] & NONASCII_MASK;
i += 1;
}
i = 0;
while i < N {
if vector[i] > 0 {
return true;
}
i += 1;
}
false
let mut i = 0;
while i < N {
if block[i] & NONASCII_MASK > 0 {
return true;
}
i += 1;
}
false

Just curious if doing this in one block is sufficent or if I missed something.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is functionally sufficient, but I have found that the "simple" version does not get auto-vectorized, at least on x86-64 and runs much slower. On aarch64, the generated code actually ran faster, cf. my initial comment. Of course, it is not a very nice solution to conform to the particular optimizing abilities of the current version of a code-generation backend, but the alternative would be writing explicit SIMD code.

@thomcc
Copy link
Member

thomcc commented May 9, 2023

Note that the standard library is currently always compiled in advance1 for the target's baseline set of target features (which is SSE2 for almost all the x86_64 targets) -- so the -Ctarget-cpu=native benchmarks in your link are somewhat irrelevant, as is the attempt to tune for AVX512.

Anyway, it's honestly really of hard to parse the benchmark numbers as they are now. Can you please provide the full result-set (for a single x86_64 machine, perhaps linking to a gist containing the aarch64 results) in one place? The criterion terminal output quoted in plaintext is probably fine.

The best performance improvement is ~2x faster, the worst is ~17.5% slower

So, with the caveat that there's no hard-and-fast rule here (on its own any regression percentage could be acceptable given what it's measuring and how much of a speedup it comes with)... A 17.5% performance hit for any case is almost certainly going to be a dealbreaker for from_utf8.

Basically, for very tiny inputs which are noisy, it'd be okay to regress a couple percent, and totally artificial benches can be fine to regress if they're longer (deliberately pathological worst-cases, for example)

But for realistic inputs which are long enough not to be dominated by noise... I don't think regressing them at all is really on the table, even if it comes with other optimizations. Maybe if it's a huge win...

That might sound harsh, but keep in mind that due to cache and branch-prediction effects, your optimization will almost perform better in a benchmark than in real life. The benchmarks measure a best-case scenario (one that is unrealistically good), so any regression will probably be worse than it looks, and any optimization won't be as good.

That said, that isn't a hard and fast rule, it's definitely case by case (for example, if you could somehow magically make ASCII 100x faster in a wide range of common scenarios, then we're very likely to take some regressions elsewhere). Feel free to submit stuff you aren't sure about.

I evaluated the changes using 6 benchmarks, validating different string sizes ranging from 27 to ~200k bytes, the results for which can be inspected here

Unfortunately, as they are now, these benchmarks are not nearly sufficient to justify this change. (I'll try to find where my repo of benchmarks for this very function went (I can't belive I didn't at least put it on github... ⚰️)

Currently, the benchmarks currently contain (may be off by one or two on the byte counts, not the point):

  • short: a 26 byte ASCII string
  • medium: a 568 byte ASCII string
  • long: a 2861 byte ASCII string
  • hamlet: a 191725 byte ASCII string
  • short-utf8: a 31 byte mostly-ASCII string (10 non-ASCII characters, 2-byte).
  • mostly-ascii: which is documented as "German text, 16'240 characters, 232 thereof non-ASCII."

Notably missing here are more sizes (2kb is not large, you should go up to maybe 10MB or so) and way more kinds of mixed-ASCII inputs (these are important and really easy to regress when optimizing for ASCII), as well as entirely non-ASCII inputs.

I also ideally would like stuff around different alignments (current code optimizes for this), but lets start with this. Also note that this is before considering the implementation details -- these are pretty much the baseline I think we need to justify an ASCII-specific optimization to such a performance-critical function.

(Note that I'm pretty sorry that we don't just have good benchmarks you can easily run. This is A Lot to lay on a contributor... I'll try to dig up the ones I wrote for #107760)

The sizes are pretty simple2 to add, but mixed-ASCII can be really hard.

More mixed-ASCII benches

It's very likely that this changes the performance of real-world mixed-ASCII inputs, which are extremely common.

It's hard to generate accurate examples of this randomly, so using real-world is kinda a must.

  1. Some non-English wikipedia page, probably Chinese. These are mostly ASCII due to markup, source code, and whitespace, but I'd benchmark two versions:

    • The download from wikipedia verbatim.
    • The download from wikipedia, but with consecutive white-space coalesced (something like s.split_ascii_whitespace().collect::<Vec<_>>().join(" ")). This is not entirely "real-world", but wikipedia pages can easily be over 50% whitespace, and this avoids needing to hunt down a separate example for more tightly packed markup.

    You can't check wp sources into this repository, so it's enough to just measure and ensure it's not regressed.

  2. Plain-text books written in a foreign language, which you can download from Project Gutenberg (use the .txt versions of course, and generally I'd recommend deleting English metadata from the book (usually at the start and end) since it makes it harder to substring the file to get something like a "50kb string of $lang".

    Beyond this, it's worth getting a book in a couple languages, since they can measure different things. For your case, I'd get one of each of:

    • Chinese or something else with very little ASCII aside from newlines and certain punctuation. This should have very little or no performance change, ideally.

    • Greek or something else with non-ASCII words but ASCII whitespace (short segments of ASCII in mostly non-ASCII is easy to regress, for example if you too eagerly enter a loop for bulk processing).

    • Vietnamese, or something else with a lot of diacritics but otherwise ASCII words and whitespace.

    • Anything else you think might pick up a case not covered by the above.

  3. Finally, measure completely non-ASCII input, for use as a baseline. Even if this is just something_nonascii.chars().filter(|s| !s.is_ascii()).collect::<String>() or whatever. This is mostly to ensure there's no surprise regressions.

This is not a complete list, and it's okay to decide to do your own thing instead, it's just an example of what I'd recommend, not a hard requirement for what you must measure (If you can show it doesn't regress cases like these in another way, that's cool too).

Finally, since this is a lot of things to measure, please label the benchmarks clearly. (Something like new/greek-plaintext/1MB or std/chinese-wiki/500kb (rather than names like "short", "long", "hamlet", etc).

Anyway, I have a lot of experience in this area (I even have my own optimization PR3, but have not been working on it recently, so don't take it as a blocker), and a lot of opinions on it, so I'll take this review.

r? @thomcc

Footnotes

  1. When -Zbuild-std eventually stabilizes then users could tune for their CPU by using --build-std while also setting -Ctarget-cpu=... or -Ctarget-feature=+..., but it's likely that the vast majority of users will not do this, so the baseline target feature set is what the stdlib needs to optimize for.

    (In the future this may be improved, but probably by letting std provide a runtime-feature detection mechanism to libcore, probably using some kind of weak linking)

  2. Note that for generating e.g. arbitrary amounts of a particular input example (for example, turning a 60kb plain-text Greek book you've include_stred into the 512kb that you want to benchmark), just do something like this.

  3. https://github.com/rust-lang/rust/pull/107760, which I need to get back to but have had a bad string of not being able to work on it (due to professional, personal, and family obligations). That said, I can't believe I didn't upload the benchmarks anywhere, so I'll try to find them for you.

@rust-log-analyzer

This comment has been minimized.

@oliver-giersch oliver-giersch force-pushed the faster_utf8_validation branch from 4c3559d to 07060be Compare May 9, 2023 08:48
@rust-log-analyzer
Copy link
Collaborator

The job mingw-check failed! Check out the build log: (web) (plain)

Click to see the possible cause of the failure (guessed by this bot)

@oliver-giersch
Copy link
Contributor Author

First, thanks for the very detailed reply!

Note that the standard library is currently always compiled in advance1 for the target's baseline set of target features (which is SSE2 for almost all the x86_64 targets) -- so the -Ctarget-cpu=native benchmarks in your link are somewhat irrelevant, as is the attempt to tune for AVX512.

Got it, I will exclude any target-cpu=native from now on.

Anyway, it's honestly really of hard to parse the benchmark numbers as they are now. Can you please provide the full result-set (for a single x86_64 machine, perhaps linking to a gist containing the aarch64 results) in one place? The criterion terminal output quoted in plaintext is probably fine.

Ok, I will redo the benchmarks and provide plain text results.

The best performance improvement is ~2x faster, the worst is ~17.5% slower

I should have qualified this better, all regressions are indeed in the very small string benchmarks (<32 bytes) and all on the Raspberry Pi, which seemed to be inherently more noisy for some reason. Specifically, I refer to this run, where the absolute difference in average runtime is ~7.5ns.

So, with the caveat that there's no hard-and-fast rule here (on its own any regression percentage could be acceptable given what it's measuring and how much of a speedup it comes with)... A 17.5% performance hit for any case is almost certainly going to be a dealbreaker for from_utf8.

Basically, for very tiny inputs which are noisy, it'd be okay to regress a couple percent, and totally artificial benches can be fine to regress if they're longer (deliberately pathological worst-cases, for example)

But for realistic inputs which are long enough not to be dominated by noise... I don't think regressing them at all is really on the table, even if it comes with other optimizations. Maybe if it's a huge win...

That might sound harsh, but keep in mind that due to cache and branch-prediction effects, your optimization will almost perform better in a benchmark than in real life. The benchmarks measure a best-case scenario (one that is unrealistically good), so any regression will probably be worse than it looks, and any optimization won't be as good.

This is indeed valid criticism, although I have to check what criterion's methodology for running benchmarks is and if there is perhaps a way to force "cold" runs exclusively.

That said, that isn't a hard and fast rule, it's definitely case by case (for example, if you could somehow magically make ASCII 100x faster in a wide range of common scenarios, then we're very likely to take some regressions elsewhere). Feel free to submit stuff you aren't sure about.

I evaluated the changes using 6 benchmarks, validating different string sizes ranging from 27 to ~200k bytes, the results for which can be inspected here

Unfortunately, as they are now, these benchmarks are not nearly sufficient to justify this change. (I'll try to find where my repo of benchmarks for this very function went (I can't belive I didn't at least put it on github... coffin)

Currently, the benchmarks currently contain (may be off by one or two on the byte counts, not the point):

* `short`: a 26 byte ASCII string

* `medium`: a 568 byte ASCII string

* `long`: a 2861 byte ASCII string

* `hamlet`: a 191725 byte ASCII string

* `short-utf8`: a 31 byte mostly-ASCII string (10 non-ASCII characters, 2-byte).

* `mostly-ascii`: which is documented as "German text, 16'240 characters, 232 thereof non-ASCII."

Notably missing here are more sizes (2kb is not large, you should go up to maybe 10MB or so) and way more kinds of mixed-ASCII inputs (these are important and really easy to regress when optimizing for ASCII), as well as entirely non-ASCII inputs.

I also ideally would like stuff around different alignments (current code optimizes for this), but lets start with this. Also note that this is before considering the implementation details -- these are pretty much the baseline I think we need to justify an ASCII-specific optimization to such a performance-critical function.

(Note that I'm pretty sorry that we don't just have good benchmarks you can easily run. This is A Lot to lay on a contributor... I'll try to dig up the ones I wrote for #107760)

The sizes are pretty simple2 to add, but mixed-ASCII can be really hard.

More mixed-ASCII benches

It's very likely that this changes the performance of real-world mixed-ASCII inputs, which are extremely common.

It's hard to generate accurate examples of this randomly, so using real-world is kinda a must.

1. Some non-English wikipedia page, probably Chinese. These are mostly ASCII due to markup, source code, and whitespace, but I'd benchmark two versions:
   
   * The download from wikipedia verbatim.
   * The download from wikipedia, but with consecutive white-space coalesced (something like `s.split_ascii_whitespace().collect::<Vec<_>>().join(" ")`). This is not entirely "real-world", but wikipedia pages can easily be over 50% whitespace, and this avoids needing to hunt down a separate example for more tightly packed markup.
   
   You can't check wp sources into this repository, so it's enough to just measure and ensure it's not regressed.

2. Plain-text books written in a foreign language, which you can download from [Project Gutenberg](https://www.gutenberg.org/) (use the `.txt` versions of course, and generally I'd recommend deleting English metadata from the book (usually at the start and end) since it makes it harder to substring the file to get something like a "50kb string of $lang".
   Beyond this, it's worth getting a book in a couple languages, since they can measure different things. For your case, I'd get one of each of:
   
   * Chinese or something else with very little ASCII aside from newlines and certain punctuation. This should have very little or no performance change, ideally.
   * Greek or something else with non-ASCII words but ASCII whitespace (short segments of ASCII in mostly non-ASCII is easy to regress, for example if you too eagerly enter a loop for bulk processing).
   * Vietnamese, or something else with a lot of diacritics but otherwise ASCII words and whitespace.
   * Anything else you think might pick up a case not covered by the above.

3. Finally, measure completely non-ASCII input, for use as a baseline. Even if this is just  `something_nonascii.chars().filter(|s| !s.is_ascii()).collect::<String>()` or whatever. This is mostly to ensure there's no surprise regressions.

This is not a complete list, and it's okay to decide to do your own thing instead, it's just an example of what I'd recommend, not a hard requirement for what you must measure (If you can show it doesn't regress cases like these in another way, that's cool too).

Finally, since this is a lot of things to measure, please label the benchmarks clearly. (Something like new/greek-plaintext/1MB or std/chinese-wiki/500kb (rather than names like "short", "long", "hamlet", etc).

Anyway, I have a lot of experience in this area (I even have my own optimization PR3, but have not been working on it recently, so don't take it as a blocker), and a lot of opinions on it, so I'll take this review.

r? @thomcc

Footnotes

1. When -Zbuild-std eventually stabilizes then users could tune for their CPU by using `--build-std` while also setting `-Ctarget-cpu=...` or `-Ctarget-feature=+...`, but it's likely that the vast majority of users will not do this, so the baseline target feature set is what the stdlib needs to optimize for.
   (In the future this may be improved, but probably by letting std provide a runtime-feature detection mechanism to libcore, probably using some kind of weak linking) [leftwards_arrow_with_hook](#user-content-fnref-1-8162d044fa3e3b849324aab6f5f1e5c7)

2. Note that for generating e.g. arbitrary amounts of a particular input example (for example, turning a 60kb plain-text Greek book you've `include_str`ed into the 512kb that you want to benchmark), just do something like [this](https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=35e3ebef32db12f04d19044967aab96b). [leftwards_arrow_with_hook](#user-content-fnref-simple-8162d044fa3e3b849324aab6f5f1e5c7)

3. https://github.com/rust-lang/rust/pull/107760, which I need to get back to but have had a bad string of not being able to work on it (due to professional, personal, and family obligations). That said, I can't believe I didn't upload the benchmarks anywhere, so I'll try to find them for you. [leftwards_arrow_with_hook](#user-content-fnref-2-8162d044fa3e3b849324aab6f5f1e5c7)

These are all very reasonable suggestions and will try to address with an updated benchmark suite.

@thomcc
Copy link
Member

thomcc commented May 9, 2023

I ran your benchmarks myself (only once each, stripped warnings about outliers, may not be representative).

# aarch64 (Apple M1 Max)
hamlet/fast             time:   [4.2781 µs 4.2880 µs 4.2984 µs]
hamlet/std              time:   [5.6628 µs 5.6747 µs 5.6877 µs]
mostly-ascii/fast       time:   [2.2430 µs 2.2476 µs 2.2535 µs]
mostly-ascii/std        time:   [2.2605 µs 2.2657 µs 2.2718 µs]
long/fast               time:   [71.570 ns 71.826 ns 72.108 ns]
long/std                time:   [97.607 ns 97.942 ns 98.320 ns]
medium/fast             time:   [18.379 ns 18.426 ns 18.478 ns]
medium/std              time:   [21.883 ns 21.922 ns 21.965 ns]
short/fast              time:   [7.8515 ns 7.8656 ns 7.8815 ns]
short/std               time:   [6.8869 ns 6.9047 ns 6.9256 ns]
short-utf8/fast         time:   [30.292 ns 30.400 ns 30.509 ns]
short-utf8/std          time:   [19.451 ns 19.486 ns 19.527 ns]

# x86_64 (Intel i9)
hamlet/fast             time:   [5.1238 µs 5.1493 µs 5.1795 µs]
hamlet/std              time:   [5.1147 µs 5.1376 µs 5.1668 µs]
mostly-ascii/fast       time:   [1.6883 µs 1.6966 µs 1.7074 µs]
mostly-ascii/std        time:   [2.0350 µs 2.0476 µs 2.0622 µs]
long/fast               time:   [57.030 ns 57.435 ns 57.946 ns]
long/std                time:   [90.237 ns 90.664 ns 91.157 ns]
medium/fast             time:   [16.943 ns 17.020 ns 17.118 ns]
medium/std              time:   [23.788 ns 24.051 ns 24.410 ns]
short/fast              time:   [7.9747 ns 8.0227 ns 8.0785 ns]
short/std               time:   [10.634 ns 10.698 ns 10.771 ns]
short-utf8/fast         time:   [23.560 ns 23.667 ns 23.786 ns]
short-utf8/std          time:   [25.010 ns 25.167 ns 25.355 ns]

It's also slightly different code in there (you should ensure that work you've done in std is synced back to your benchmark repo before running benchmarks you share here).


Some other thoughts:

  1. Current benchmarks aren't apples-to-apples:

    • Your version returns a bool, you need to include error information like std's does (fix your bugs around it first, though 😉).
    • You shouldn't have #[inline] on your implementation, since std's version won't be inlined (and we probably don't want to change that -- it's big and called too many places). Given that missing inline in std will definitely mean it can't be inlined, a #[inline(never)] might be fairest...
  2. The CI failures indicate the UTF-8 error information isn't correctly computed (which people rely on for replacing errors and such, so it's of course non-optional).

    After fixing the issue, you should should steal the tests I wrote for the other PR, and make sure they pass. Possibly consider doing a bit of fuzzing to ensure there are no lurking bugs. This is pretty complex and while most of it looks correct... yeah.

  3. The special-cases for 8-word, 4-word, and 2-word blocks does make me worry in a couple ways:

    1. First off I'm not 100% convinced it won't do worse in practice than in the benchmark. Adding extra branches like this can often can hurt the branch predictor in real-world code, but look fine in microbenchmarks, due to the fact that it runs the same input over and over (not to mention issues with icache for loop bodies, which is quite hard to measure...).

    2. The extra code and complexity generated from this might be a bit much -- the numbers need to be really great to justify adding 12x-unrolled (8+4) ASCII word check (on top of the 2x-unrolled version we already have).

    Note that code size in libcore is something we've been more concerned about recently (the Linux kernel in particular has been nagging us about it, but it's always been a lurking issue). I'm unsure we want to add a version that's complex and heavily unrolled (generally we can frequently improve performance on large inputs by unrolling more aggressively, but there are also many reasons to be cautious with going too far with it).

    That said, re:code size, I haven't looked at the asm. It's possible that cleans up to something pretty small, in which case that aspect is not a concern really.

The benchmarks measure a best-case scenario (one that is unrealistically good), so any regression will probably be worse than it looks, and any optimization won't be as good.
This is indeed valid criticism, although I have to check what criterion's methodology for running benchmarks is and if there is perhaps a way to force "cold" runs exclusively

It's not intended as criticism, it's just a reason we have to be careful around regressions that are significant. There's no criterion option for this -- the only benchmark suite I've seen that tries to account for branch mispredictions is the one in highwayhash, and that's part of the core of it's design, not really something that can be handled by a flag (since it's more complex than forcing cold runs). In general this approach is hard to apply to something like utf-8 validation.

Concretely, I said it just to note that you should be concerned about most regressions (since we have reason to be picky about them), and about constructs that are likely to look better in a benchmark than they perform in real life.


/// Returns `true` if any one block is not a valid ASCII byte.
#[inline(always)]
const fn has_non_ascii_byte<const N: usize>(block: &[usize; N]) -> bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite gross, TBH...

@oliver-giersch
Copy link
Contributor Author

I ran your benchmarks myself (only once each, stripped warnings about outliers, may not be representative).

# aarch64 (Apple M1 Max)
hamlet/fast             time:   [4.2781 µs 4.2880 µs 4.2984 µs]
hamlet/std              time:   [5.6628 µs 5.6747 µs 5.6877 µs]
mostly-ascii/fast       time:   [2.2430 µs 2.2476 µs 2.2535 µs]
mostly-ascii/std        time:   [2.2605 µs 2.2657 µs 2.2718 µs]
long/fast               time:   [71.570 ns 71.826 ns 72.108 ns]
long/std                time:   [97.607 ns 97.942 ns 98.320 ns]
medium/fast             time:   [18.379 ns 18.426 ns 18.478 ns]
medium/std              time:   [21.883 ns 21.922 ns 21.965 ns]
short/fast              time:   [7.8515 ns 7.8656 ns 7.8815 ns]
short/std               time:   [6.8869 ns 6.9047 ns 6.9256 ns]
short-utf8/fast         time:   [30.292 ns 30.400 ns 30.509 ns]
short-utf8/std          time:   [19.451 ns 19.486 ns 19.527 ns]

# x86_64 (Intel i9)
hamlet/fast             time:   [5.1238 µs 5.1493 µs 5.1795 µs]
hamlet/std              time:   [5.1147 µs 5.1376 µs 5.1668 µs]
mostly-ascii/fast       time:   [1.6883 µs 1.6966 µs 1.7074 µs]
mostly-ascii/std        time:   [2.0350 µs 2.0476 µs 2.0622 µs]
long/fast               time:   [57.030 ns 57.435 ns 57.946 ns]
long/std                time:   [90.237 ns 90.664 ns 91.157 ns]
medium/fast             time:   [16.943 ns 17.020 ns 17.118 ns]
medium/std              time:   [23.788 ns 24.051 ns 24.410 ns]
short/fast              time:   [7.9747 ns 8.0227 ns 8.0785 ns]
short/std               time:   [10.634 ns 10.698 ns 10.771 ns]
short-utf8/fast         time:   [23.560 ns 23.667 ns 23.786 ns]
short-utf8/std          time:   [25.010 ns 25.167 ns 25.355 ns]

It's also slightly different code in there (you should ensure that work you've done in std is synced back to your benchmark repo before running benchmarks you share here).

Agreed, the side repo is a bit messy right now, I will try to clean this up and make sure master is in sync with the PR.

Some other thoughts:

1. Current benchmarks aren't apples-to-apples:
   
   * Your version returns a bool, you need to include error information like std's does (fix your bugs around it first, though wink).
   * You shouldn't have `#[inline]` on your implementation, since `std`'s version won't be inlined (and we probably don't want to change that -- it's big and called too many places). Given that missing inline in std will definitely mean it can't be inlined, a `#[inline(never)]` might be fairest...

Also fair.

2. The CI failures indicate the UTF-8 error information isn't correctly computed (which people rely on for replacing errors and such, so it's of course non-optional).
   After fixing the issue, you should should steal [the tests I wrote for the other PR](https://gist.github.com/thomcc/acd2cc498e1c4c623176ea934b171a88), and make sure they pass. Possibly consider doing a bit of fuzzing to ensure there are no lurking bugs. This is pretty complex and while most of it looks correct... yeah.

From what I can tell so far it's not wrong, it just reports errors differently. The upstream implementation will try to validate further bytes even if there a not enough bytes in the string for the expected char width, whereas my implementation fails earlier, since it validates the length right at the beginning.

This was not intended, though and I will have to see how to deal with this. Intuitively, failing fast seems to make more sense, because it is less work, but it also a breaking change and thereby might not be desirable.

3. The special-cases for 8-word, 4-word, and 2-word blocks does make me worry in a couple ways:
   
   1. First off I'm not 100% convinced it won't do worse in practice than in the benchmark.  Adding extra branches like this can often can hurt the branch predictor in real-world code, but look fine in microbenchmarks, due to the fact that it runs the same input over and over (not to mention issues with icache for loop bodies, which is quite hard to measure...).
   2. The extra code and complexity generated from this might be a bit much -- the numbers need to be really great to justify adding 12x-unrolled (8+4) ASCII word check (on top of the 2x-unrolled version we already have).

The added loops definitely hurt performance (slightly, although only in the micro-benchmarks of course) for short strings, since the added checks (loop conditions) eat up runtime without providing benefits. It might make sense to have different versions for "short" and "long" strings, which separate at some length threshold that is TBD. This would be even worse for code size, so a reasonable balance has to be found.

I also want to try writing a benchmark that measures validating a long list of short strings individually and run this only once (cold) but sufficiently often to drown out noise.

   Note that code size in libcore is something we've been more concerned about recently (the Linux kernel in particular has been nagging us about it, but it's always been a lurking issue). I'm unsure we want to add a version that's complex and heavily unrolled (generally we can frequently improve performance on large inputs by unrolling more aggressively, but there are also many reasons to be cautious with going too far with it).
   That said, re:code size, I haven't looked at the asm. It's possible that cleans up to something pretty small, in which case that aspect is not a concern really.

This is a reasonable concern as well, I will have to check and compare the generated code sizes. I am not sure how large a (consistent) runtime benefit would have to be to offset a larger instruction cache footprint, but I'd wager for such a fundamental function even a small improvement would be worth it. But the emphasis of course is on consistent here.

@thomcc
Copy link
Member

thomcc commented May 9, 2023

From what I can tell so far it's not wrong, it just reports errors differently.

The error length we return is the number of bytes that make up a maximal prefix of a valid UTF-8 code unit sequence. This is pretty much what everybody uses for this at this point, and is also how many bytes are replaced by the replacement in things like from_utf8_lossy. You're correct that it's not the only value that could be chosen (see https://unicode.org/review/pr-121.html), but it's not a thing we can change now, even if we wanted to (we do not, it's actually the best choice for various reasons, even if it's a bit of a pain to compute).

@Dylan-DPC
Copy link
Member

@oliver-giersch any updates? thanks

@Dylan-DPC
Copy link
Member

Closing this as inactive. Feel free to reöpen this pr or create a new pr if you get the time to work on this. Thanks

@Dylan-DPC Dylan-DPC closed this Oct 25, 2023
@Dylan-DPC Dylan-DPC added S-inactive Status: Inactive and waiting on the author. This is often applied to closed PRs. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Oct 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-inactive Status: Inactive and waiting on the author. This is often applied to closed PRs. T-libs Relevant to the library team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.