-
Notifications
You must be signed in to change notification settings - Fork 13.4k
Faster UTF-8 string validation #111367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster UTF-8 string validation #111367
Conversation
(rustbot has picked a reviewer for you, use r? to override) |
Hey! It looks like you've submitted a new PR for the library teams! If this PR contains changes to any Examples of
|
Be aware that the standard library is precompiled, so many operations which use UTF-8 validation in the standard library will simply not be vectorized to use AVX512 instructions. Instead, they will use SSE2 instructions, or whatever else is in the target definition. @bors try @rust-timer queue |
Awaiting bors try build completion. @rustbot label: +S-waiting-on-perf |
⌛ Trying commit eab17e7 with merge 91111e127332600ac24a69ec23df526b5579a3ef... |
This comment has been minimized.
This comment has been minimized.
Oh dear. Well, I suppose this may be a very short bors try. |
I hope you don't mind me lightening your load slightly, Josh. 🙏 |
This comment has been minimized.
This comment has been minimized.
💔 Test failed - checks-actions |
This comment has been minimized.
This comment has been minimized.
library/core/src/str/validations.rs
Outdated
} else { | ||
// byte is < 128 (ASCII), but pointer is not word-aligned, skip | ||
// until the loop reaches the next word-aligned block) | ||
for _ in 0..offset { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can just use a while
like the rest of these cases, as we've removed ~const
traits in the rest of std.
for _ in 0..offset { | |
let mut i = 0; | |
while i < offset { | |
i += 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, sorry for missing that one, Fixed in the latest commit.
Note that there's another PR open #107760 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tried benching a simpler version that doesn't use the 4-word and 8-word blocks, or indeed just skips "words" altogether, and instead uses 16-byte alignment unconditionally? We have several 32-bit targets that still have the capability to use 128-bit hardware vectors and still prefer vector alignment when doing so. This kind of ASCII-specific fast-path should be able to do more than a 2x speedup, though it might be hard to accomplish when working mostly with autovectorization.
3 => { | ||
let Some([_, b1, b2]) = subarray(buf, curr) else { | ||
err!(None); | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to take the previous suggestion.
3 => { | |
let Some([_, b1, b2]) = subarray(buf, curr) else { | |
err!(None); | |
}; | |
Some([_, b1, b2]) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Sorry I screwed up and accidentally applied the initial suggestion instead of commenting)
This would not compile, since a single call to the (const-generic) subarray
can not return arrays of different sizes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, you're right.
...I'm not sure just slicing wouldn't be better, but fair.
let mut vector = [0; N]; | ||
|
||
let mut i = 0; | ||
while i < N { | ||
vector[i] = block[i] & NONASCII_MASK; | ||
i += 1; | ||
} | ||
|
||
i = 0; | ||
while i < N { | ||
if vector[i] > 0 { | ||
return true; | ||
} | ||
i += 1; | ||
} | ||
|
||
false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let mut vector = [0; N]; | |
let mut i = 0; | |
while i < N { | |
vector[i] = block[i] & NONASCII_MASK; | |
i += 1; | |
} | |
i = 0; | |
while i < N { | |
if vector[i] > 0 { | |
return true; | |
} | |
i += 1; | |
} | |
false | |
let mut i = 0; | |
while i < N { | |
if block[i] & NONASCII_MASK > 0 { | |
return true; | |
} | |
i += 1; | |
} | |
false |
Just curious if doing this in one block is sufficent or if I missed something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is functionally sufficient, but I have found that the "simple" version does not get auto-vectorized, at least on x86-64 and runs much slower. On aarch64, the generated code actually ran faster, cf. my initial comment. Of course, it is not a very nice solution to conform to the particular optimizing abilities of the current version of a code-generation backend, but the alternative would be writing explicit SIMD code.
Note that the standard library is currently always compiled in advance1 for the target's baseline set of target features (which is SSE2 for almost all the x86_64 targets) -- so the Anyway, it's honestly really of hard to parse the benchmark numbers as they are now. Can you please provide the full result-set (for a single x86_64 machine, perhaps linking to a gist containing the aarch64 results) in one place? The criterion terminal output quoted in plaintext is probably fine.
So, with the caveat that there's no hard-and-fast rule here (on its own any regression percentage could be acceptable given what it's measuring and how much of a speedup it comes with)... A 17.5% performance hit for any case is almost certainly going to be a dealbreaker for Basically, for very tiny inputs which are noisy, it'd be okay to regress a couple percent, and totally artificial benches can be fine to regress if they're longer (deliberately pathological worst-cases, for example) But for realistic inputs which are long enough not to be dominated by noise... I don't think regressing them at all is really on the table, even if it comes with other optimizations. Maybe if it's a huge win... That might sound harsh, but keep in mind that due to cache and branch-prediction effects, your optimization will almost perform better in a benchmark than in real life. The benchmarks measure a best-case scenario (one that is unrealistically good), so any regression will probably be worse than it looks, and any optimization won't be as good. That said, that isn't a hard and fast rule, it's definitely case by case (for example, if you could somehow magically make ASCII 100x faster in a wide range of common scenarios, then we're very likely to take some regressions elsewhere). Feel free to submit stuff you aren't sure about.
Unfortunately, as they are now, these benchmarks are not nearly sufficient to justify this change. (I'll try to find where my repo of benchmarks for this very function went (I can't belive I didn't at least put it on github... ⚰️) Currently, the benchmarks currently contain (may be off by one or two on the byte counts, not the point):
Notably missing here are more sizes (2kb is not large, you should go up to maybe 10MB or so) and way more kinds of mixed-ASCII inputs (these are important and really easy to regress when optimizing for ASCII), as well as entirely non-ASCII inputs. I also ideally would like stuff around different alignments (current code optimizes for this), but lets start with this. Also note that this is before considering the implementation details -- these are pretty much the baseline I think we need to justify an ASCII-specific optimization to such a performance-critical function. (Note that I'm pretty sorry that we don't just have good benchmarks you can easily run. This is A Lot to lay on a contributor... I'll try to dig up the ones I wrote for #107760) The sizes are pretty simple2 to add, but mixed-ASCII can be really hard. More mixed-ASCII benchesIt's very likely that this changes the performance of real-world mixed-ASCII inputs, which are extremely common. It's hard to generate accurate examples of this randomly, so using real-world is kinda a must.
This is not a complete list, and it's okay to decide to do your own thing instead, it's just an example of what I'd recommend, not a hard requirement for what you must measure (If you can show it doesn't regress cases like these in another way, that's cool too). Finally, since this is a lot of things to measure, please label the benchmarks clearly. (Something like Anyway, I have a lot of experience in this area (I even have my own optimization PR3, but have not been working on it recently, so don't take it as a blocker), and a lot of opinions on it, so I'll take this review. r? @thomcc Footnotes
|
This comment has been minimized.
This comment has been minimized.
4c3559d
to
07060be
Compare
First, thanks for the very detailed reply!
Got it, I will exclude any
Ok, I will redo the benchmarks and provide plain text results.
I should have qualified this better, all regressions are indeed in the very small string benchmarks (<32 bytes) and all on the Raspberry Pi, which seemed to be inherently more noisy for some reason. Specifically, I refer to this run, where the absolute difference in average runtime is ~7.5ns.
This is indeed valid criticism, although I have to check what criterion's methodology for running benchmarks is and if there is perhaps a way to force "cold" runs exclusively.
These are all very reasonable suggestions and will try to address with an updated benchmark suite. |
I ran your benchmarks myself (only once each, stripped warnings about outliers, may not be representative).
It's also slightly different code in there (you should ensure that work you've done in std is synced back to your benchmark repo before running benchmarks you share here). Some other thoughts:
It's not intended as criticism, it's just a reason we have to be careful around regressions that are significant. There's no criterion option for this -- the only benchmark suite I've seen that tries to account for branch mispredictions is the one in highwayhash, and that's part of the core of it's design, not really something that can be handled by a flag (since it's more complex than forcing cold runs). In general this approach is hard to apply to something like utf-8 validation. Concretely, I said it just to note that you should be concerned about most regressions (since we have reason to be picky about them), and about constructs that are likely to look better in a benchmark than they perform in real life. |
|
||
/// Returns `true` if any one block is not a valid ASCII byte. | ||
#[inline(always)] | ||
const fn has_non_ascii_byte<const N: usize>(block: &[usize; N]) -> bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite gross, TBH...
Agreed, the side repo is a bit messy right now, I will try to clean this up and make sure master is in sync with the PR.
Also fair.
From what I can tell so far it's not wrong, it just reports errors differently. The upstream implementation will try to validate further bytes even if there a not enough bytes in the string for the expected This was not intended, though and I will have to see how to deal with this. Intuitively, failing fast seems to make more sense, because it is less work, but it also a breaking change and thereby might not be desirable.
The added loops definitely hurt performance (slightly, although only in the micro-benchmarks of course) for short strings, since the added checks (loop conditions) eat up runtime without providing benefits. It might make sense to have different versions for "short" and "long" strings, which separate at some length threshold that is TBD. This would be even worse for code size, so a reasonable balance has to be found. I also want to try writing a benchmark that measures validating a long list of short strings individually and run this only once (cold) but sufficiently often to drown out noise.
This is a reasonable concern as well, I will have to check and compare the generated code sizes. I am not sure how large a (consistent) runtime benefit would have to be to offset a larger instruction cache footprint, but I'd wager for such a fundamental function even a small improvement would be worth it. But the emphasis of course is on consistent here. |
The error length we return is the number of bytes that make up a maximal prefix of a valid UTF-8 code unit sequence. This is pretty much what everybody uses for this at this point, and is also how many bytes are replaced by the replacement in things like |
@oliver-giersch any updates? thanks |
Closing this as inactive. Feel free to reöpen this pr or create a new pr if you get the time to work on this. Thanks |
This PR rewrites much of the UTF-8 validation code, optimizing it in particular for validating ASCII strings. This does not mean, there is an artificial bias towards pure ASCII strings, just that this path is particularly optimized. These optimizations come from 2 primary directions:
2 * mem::size_of::<usize>() - 1
have to be checked twice. In the new version, the same block is checked again using word-wise operations, so that fewer bytes have to be checked individually. This is particularly important due to the larger block sizes.The code for checking detected non-ASCII bytes/byte-sequences remains relatively unaltered, so now great differences are to be expected for non-ASCII strings. I evaluated the changes using 6 benchmarks, validating different string sizes ranging from 27 to ~200k bytes, the results for which can be inspected here. The best performance improvement is ~2x faster, the worst is ~17.5% slower (for short strings on a Raspberry Pi 4 with a relatively small regression in absolute terms).
Since UTF-8 validation is a pretty fundamental operation in Rust, these changes should ideally be evaluated on more platforms, since I could only run these benchmarks on systems I had access to, which were 3 different
x86-64
systems and oneaarch64
. Any help here would definitely appreciated, the benchmarking code can be found here, in a separate repo.Note: I found, that the performance on
aarch64
could be further improved by using a specialized version of thehas_non_ascii_byte
function like this:I eventually decided against including this in the PR, since it is fairly arbitrary and overall probably dependent on which optimization passes are implemented differently by LLVM for different architectures and I have no idea how this would affect other platforms I have not myself evaluated.