-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
Add native UTF-8 Validation using fast shift based DFA #47880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add native UTF-8 Validation using fast shift based DFA #47880
Conversation
Right now this isn't working, working to resolve test failures |
Just a further update, it looks like the state table from the reference I had is wrong. I am working on rebuilding it should be ready tomorrow. |
This is now passing all test on my system I don't know why buildkite is failing, could someone relaunch the test for me? |
Github is broken :) |
Is the PR the right place to document the build process for the state machine table? |
I'd put the table information in a comment in the code. |
I think with the final documentation of the methodology it is ready to go. |
@stevengj & @StefanKarpinski (sorry to drag you into this, but I think you did a lot of the initial strings work) Initial testing seemed to indicate that |
Happy to be pulled in! It certainly seems cleaner to use |
The implementation has changed to using |
@stevengj, @StefanKarpinski & @oscardssmith I have reversed or order of operations on the shift dfa which leads to state that is never dirty (ie. you would have to Also the use of With this PR the processing rate is about 0.5 bytes / cycle for short strings (word to line length) and 1.5 bytes/cycle for longer strings ( file length). This is on par (short) to better (long) than the c implementation. There are two PRs I plan to follow up with:
Let me know if there is anything else that needs to be done to this. |
Everything seems ready to go. |
This is great work. Now that the implementation has moved to Julia, I think we need much more comprehensive tests of the functionality. Previously we could just assume that |
Now that everything goes the the DFA would you agree that if the tests validate the DFA, we can assume |
Test have been added to validate that the DFA state machine returns state as expected and per the Unicode spec. |
935c46d
to
8006d60
Compare
@mkitti The rebase is done at this point, but thank you. Does anyone know why these tests are failing? |
I'm guessing the test failures is unrelated. Could be a problem on master or maybe CI. @oscardssmith do we want to wait until master can pass tests, or is this mergeable now? |
They look unrelated. I'm happy to merge as is if no one objects in the next day or two. |
@mkitti & @oscardssmith this is a little off topic, but when I rebase I normally just merge master to my local master and rebase to that. I have had bad luck recently in picking masters that seem to fail tests. So the question is there a tagged master which is always passing all tests? |
not really. in theory, master is supposed to pass tests, but in practice that sometimes doesn't happen. |
Master does seem to be passing again... we should we rerun the failing test? |
Co-authored-by: Steven G. Johnson <stevenj@mit.edu>
Is the REPL failure related to this? I am sure it is text heavy but I don't see any of the changed functions in this PR returning a Union? |
I'm slightly scared by this so I'm re-running CI. |
Looks like it cleared |
Exciting times! Was considering how to review this, but it's well tested and I think we'll notice if string stuff is broken. |
* Working Native UTF-8 Validation --------- Co-authored-by: Oscar Smith <oscardssmith@gmail.com> Co-authored-by: Steven G. Johnson <stevenj@mit.edu>
Rewrite UTF-8 validation in shift-based DFA for 53%~133% performance increase on non-ASCII strings Take 2 of rust-lang#107760 (cc `@thomcc)` ### Background About the technique: https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725 As stated in rust-lang#107760, > For prior art: shift-DFAs are now used for UTF-8 validation in [PostgreSQL](https://github.com/postgres/postgres/blob/aa6954104644334c53838f181053b9f7aa13f58c/src/common/wchar.c#L1753), and seems to be in progress or under consideration for use in JuliaLang/julia#47880 and perhaps golang/go#47120. Of these, PG's impl is the most similar to this one, at least at a high level[1](rust-lang#107760 (comment)). ### Rationales 1. Performance: This algorithm gives plenty of performance increase on validating strings with many non-ASCII codepoints, which is the normal case for almost every non-English content. 2. Generality: It does not use SIMD instructions and does not rely on the branch predictor to get a good performance, thus is good as a general, default, architecture-agnostic implementation. There is still a bypass for ASCII-only strings to benefit from auto-vectorization, if the target supports. ### Implementation details I use the ordinary UTF-8 language definition from [RFC3692](https://datatracker.ietf.org/doc/html/rfc3629#section-4) and directly translate it into a 9-state DFA. So the compressed state is 64-bit, resulting in a table of `[u64; 256]`, or 2KiB rodata. The main algorithm consists of following parts: 1. Main loop: taking a chunk of `MAIN_CHUNK_SIZE = 16` bytes on each iteration, execute the DFA on the chunk, and check if the state is in ERROR once per chunk. 2. ASCII bypass: in each chunk iteration, if the current state is ACCEPT, we know we are not in the middle of an encoded sequence, so we can skip a large block of trivial ASCIIs and stop at the first chunk containing any non-ASCII bytes. I choose `ASCII_CHUNK_SIZE = 16` to align with the current implementation: taking 16 bytes each to check non-ASCIIs, to encourage LLVM auto-vectorize it. 3. Trailing chunk and error reporting: execute the DFA step by step, stop on error as soon as possible, and calculate the error/valid location. To be simple, if any error are encountered in the main loop, it will discard the errornous chunk and `break` into this path to find the precise error location. That is, the erronous chunk, if exists, will be traversed twice, in exchange for a tighter and more efficient hot loop. There are also some small tricks being used: 1. Since we have i686-linux in Tier 1 support, and its 64-bit shift (SHRD) is quite slow in our latency-intensive hot loop, I arrange the state storage so that the state transition can be done in 32-bit shift and conditional move. It shows a 200%+ speed up comparing to 64-bit-shift version. 2. We still need to get UTF-8 encoded length from the first byte in `utf8_char_width`. I merge the previous lookup table into the unused high bits of the DFA transition table. So we don't need two tables. It did introduce an extra 32-bit shift. I believe it's almost free but have not benchmarked yet. ### Benchmarks I made an [out-of-tree implementation repository](https://github.com/oxalica/shift-dfa-utf8) for easier testing and benching. It also tested various `MAIN_CHUNK_SIZE` (m) and `ASCII_CHUNK_SIZE` (a) configurations. Bench data are taken from the first 4KiB (from the first paragraph, plain text not HTML, cut at char boundary) of Wikipedia [William Shakespeare in en](https://en.wikipedia.org/wiki/William_Shakespeare), [es](https://es.wikipedia.org/wiki/William_Shakespeare) and [zh](https://zh.wikipedia.org/wiki/%E5%A8%81%E5%BB%89%C2%B7%E8%8E%8E%E5%A3%AB%E6%AF%94%E4%BA%9A) language. In short: with m=16, a=16, shift-DFA performance gives -43% on en, +53% on es, +133% on zh; with m=16, a=32, it gives -9% on en, +26% on es, +33% on zh. It's quite expected, as the larger ASCII bypass chunk is, it performs better on ASCII, but worse on mixed content like es because of the taken branch is flipping around. To me, the difference between 27GB/s vs 47GB/s in en is minimal in absolute time 144.61ns - 79.86ns = 64.75ns, comparing to 476.05ns - 392.44ns = 83.61ns in es. So I currently chose m=16, a=16 in the PR. On x86\_64-linux, Ryzen 7 5700G `@3.775GHz:` | Algorithm | Input language | Throughput / (GiB/s) | |-------------------|----------------|-----------------------| | std | en | 47.768 +-0.301 | | shift-dfa-m16-a16 | en | 27.337 +-0.002 | | shift-dfa-m16-a32 | en | 43.627 +-0.006 | | std | es | 6.339 +-0.010 | | shift-dfa-m16-a16 | es | 9.721 +-0.014 | | shift-dfa-m16-a32 | es | 8.013 +-0.009 | | std | zh | 1.463 +-0.000 | | shift-dfa-m16-a16 | zh | 3.401 +-0.002 | | shift-dfa-m16-a32 | zh | 3.407 +-0.001 | ### Unresolved - [ ] Benchmark on aarch64-darwin, another tier 1 target. I don't have a machine to play with. - [ ] Decide the chunk size parameters. I'm currently picking m=16, a=16. - [ ] Should we also replace the implementation of [lossy conversion](https://github.com/oxalica/rust/blob/c0639b8cad126d886ddd88964f729dd33fb90e67/library/core/src/str/lossy.rs#L194) by calling the new validation function? It has a very similar code doing almost the same thing.
Rewrite UTF-8 validation in shift-based DFA for 53%~133% performance increase on non-ASCII strings Take 2 of rust-lang#107760 (cc `@thomcc)` ### Background About the technique: https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725 As stated in rust-lang#107760, > For prior art: shift-DFAs are now used for UTF-8 validation in [PostgreSQL](https://github.com/postgres/postgres/blob/aa6954104644334c53838f181053b9f7aa13f58c/src/common/wchar.c#L1753), and seems to be in progress or under consideration for use in JuliaLang/julia#47880 and perhaps golang/go#47120. Of these, PG's impl is the most similar to this one, at least at a high level[1](rust-lang#107760 (comment)). ### Rationales 1. Performance: This algorithm gives plenty of performance increase on validating strings with many non-ASCII codepoints, which is the normal case for almost every non-English content. 2. Generality: It does not use SIMD instructions and does not rely on the branch predictor to get a good performance, thus is good as a general, default, architecture-agnostic implementation. There is still a bypass for ASCII-only strings to benefit from auto-vectorization, if the target supports. ### Implementation details I use the ordinary UTF-8 language definition from [RFC3692](https://datatracker.ietf.org/doc/html/rfc3629#section-4) and directly translate it into a 9-state DFA. So the compressed state is 64-bit, resulting in a table of `[u64; 256]`, or 2KiB rodata. The main algorithm consists of following parts: 1. Main loop: taking a chunk of `MAIN_CHUNK_SIZE = 16` bytes on each iteration, execute the DFA on the chunk, and check if the state is in ERROR once per chunk. 2. ASCII bypass: in each chunk iteration, if the current state is ACCEPT, we know we are not in the middle of an encoded sequence, so we can skip a large block of trivial ASCIIs and stop at the first chunk containing any non-ASCII bytes. I choose `ASCII_CHUNK_SIZE = 16` to align with the current implementation: taking 16 bytes each to check non-ASCIIs, to encourage LLVM auto-vectorize it. 3. Trailing chunk and error reporting: execute the DFA step by step, stop on error as soon as possible, and calculate the error/valid location. To be simple, if any error are encountered in the main loop, it will discard the errornous chunk and `break` into this path to find the precise error location. That is, the erronous chunk, if exists, will be traversed twice, in exchange for a tighter and more efficient hot loop. There are also some small tricks being used: 1. Since we have i686-linux in Tier 1 support, and its 64-bit shift (SHRD) is quite slow in our latency-intensive hot loop, I arrange the state storage so that the state transition can be done in 32-bit shift and conditional move. It shows a 200%+ speed up comparing to 64-bit-shift version. 2. We still need to get UTF-8 encoded length from the first byte in `utf8_char_width`. I merge the previous lookup table into the unused high bits of the DFA transition table. So we don't need two tables. It did introduce an extra 32-bit shift. I believe it's almost free but have not benchmarked yet. ### Benchmarks I made an [out-of-tree implementation repository](https://github.com/oxalica/shift-dfa-utf8) for easier testing and benching. It also tested various `MAIN_CHUNK_SIZE` (m) and `ASCII_CHUNK_SIZE` (a) configurations. Bench data are taken from the first 4KiB (from the first paragraph, plain text not HTML, cut at char boundary) of Wikipedia [William Shakespeare in en](https://en.wikipedia.org/wiki/William_Shakespeare), [es](https://es.wikipedia.org/wiki/William_Shakespeare) and [zh](https://zh.wikipedia.org/wiki/%E5%A8%81%E5%BB%89%C2%B7%E8%8E%8E%E5%A3%AB%E6%AF%94%E4%BA%9A) language. In short: with m=16, a=16, shift-DFA performance gives -43% on en, +53% on es, +133% on zh; with m=16, a=32, it gives -9% on en, +26% on es, +33% on zh. It's quite expected, as the larger ASCII bypass chunk is, it performs better on ASCII, but worse on mixed content like es because of the taken branch is flipping around. To me, the difference between 27GB/s vs 47GB/s in en is minimal in absolute time 144.61ns - 79.86ns = 64.75ns, comparing to 476.05ns - 392.44ns = 83.61ns in es. So I currently chose m=16, a=16 in the PR. On x86\_64-linux, Ryzen 7 5700G `@3.775GHz:` | Algorithm | Input language | Throughput / (GiB/s) | |-------------------|----------------|-----------------------| | std | en | 47.768 +-0.301 | | shift-dfa-m16-a16 | en | 27.337 +-0.002 | | shift-dfa-m16-a32 | en | 43.627 +-0.006 | | std | es | 6.339 +-0.010 | | shift-dfa-m16-a16 | es | 9.721 +-0.014 | | shift-dfa-m16-a32 | es | 8.013 +-0.009 | | std | zh | 1.463 +-0.000 | | shift-dfa-m16-a16 | zh | 3.401 +-0.002 | | shift-dfa-m16-a32 | zh | 3.407 +-0.001 | ### Unresolved - [ ] Benchmark on aarch64-darwin, another tier 1 target. I don't have a machine to play with. - [ ] Decide the chunk size parameters. I'm currently picking m=16, a=16. - [ ] Should we also replace the implementation of [lossy conversion](https://github.com/oxalica/rust/blob/c0639b8cad126d886ddd88964f729dd33fb90e67/library/core/src/str/lossy.rs#L194) by calling the new validation function? It has a very similar code doing almost the same thing.
This is a based on the discussion in #41533. It is a julia implementation of a shift based DFA implemented with inspiration from golang/go#47120.
*** Edit the benchmarks have been updated using the code found in ndinsmore/StringBenchmarks***
Throughput improvement: small strings -> large strings
Master:
This PR: