Skip to content

Add native UTF-8 Validation using fast shift based DFA #47880

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 38 commits into from
Apr 12, 2023

Conversation

ndinsmore
Copy link
Contributor

@ndinsmore ndinsmore commented Dec 12, 2022

This is a based on the discussion in #41533. It is a julia implementation of a shift based DFA implemented with inspiration from golang/go#47120.

*** Edit the benchmarks have been updated using the code found in ndinsmore/StringBenchmarks***

Throughput improvement: small strings -> large strings

  • ASCII: 1.4x -> 20x
  • single nonASCII: 1.1x -> 10x
  • Unicode: 1.2x -> 2.4x

Master:

"isvalid" => 4-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "single nonASCII" => 4-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "length 2:64" => Trial(1.873 ms)
	  "length 512:4096" => Trial(636.962 μs)
	  "length 64:512" => Trial(733.949 μs)
	  "length 4096:32768" => Trial(585.777 μs)
  "julia 1.9 source" => 4-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "files" => Trial(4.258 ms)
	  "lines" => Trial(7.950 ms)
	  "files SubString" => Trial(4.094 ms)
	  "lines SubString" => Trial(7.714 ms)
  "Unicode" => 4-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "length 2:64" => Trial(1.991 ms)
	  "length 512:4096" => Trial(1.588 ms)
	  "length 64:512" => Trial(1.658 ms)
	  "length 4096:32768" => Trial(1.561 ms)
  "ASCII" => 4-element BenchmarkTools.BenchmarkGroup:
	  tags: []
	  "length 2:64" => Trial(1.123 ms)
	  "length 512:4096" => Trial(703.426 μs)
	  "length 64:512" => Trial(750.386 μs)
	  "length 4096:32768" => Trial(730.985 μs)

This PR:

  "isvalid" =>4-element BenchmarkTools.BenchmarkGroup:
    tags: []
    "single nonASCII" => 4-element BenchmarkTools.BenchmarkGroup:
	    tags: []
	    "length 2:64" => Trial(1.757 ms)
	    "length 512:4096" => Trial(164.487 μs)
	    "length 64:512" => Trial(659.963 μs)
	    "length 4096:32768" => Trial(62.674 μs)
    "julia 1.9 source" => 4-element BenchmarkTools.BenchmarkGroup:
	    tags: []
	    "files" => Trial(1.025 ms)
	    "lines" => Trial(5.355 ms)
	    "files SubString" => Trial(870.075 μs)
	    "lines SubString" => Trial(5.262 ms)
    "Unicode" => 4-element BenchmarkTools.BenchmarkGroup:
	    tags: []
	    "length 2:64" => Trial(1.610 ms)
	    "length 512:4096" => Trial(671.044 μs)
	    "length 64:512" => Trial(840.627 μs)
	    "length 4096:32768" => Trial(654.662 μs)
    "ASCII" => 4-element BenchmarkTools.BenchmarkGroup:
	    tags: []
	    "length 2:64" => Trial(794.437 μs)
	    "length 512:4096" => Trial(55.955 μs)
	    "length 64:512" => Trial(177.169 μs)
	    "length 4096:32768" => Trial(37.669 μs)

@ndinsmore
Copy link
Contributor Author

Right now this isn't working, working to resolve test failures

@ndinsmore
Copy link
Contributor Author

Just a further update, it looks like the state table from the reference I had is wrong. I am working on rebuilding it should be ready tomorrow.

@ndinsmore
Copy link
Contributor Author

This is now passing all test on my system I don't know why buildkite is failing, could someone relaunch the test for me?

@gbaraldi
Copy link
Member

Github is broken :)

@ndinsmore
Copy link
Contributor Author

Is the PR the right place to document the build process for the state machine table?

@oscardssmith
Copy link
Member

I'd put the table information in a comment in the code.

@ndinsmore
Copy link
Contributor Author

ndinsmore commented Dec 13, 2022

I think with the final documentation of the methodology it is ready to go.

@brenhinkeller brenhinkeller added strings "Strings!" performance Must go faster unicode Related to unicode characters and encodings labels Dec 18, 2022
@ndinsmore
Copy link
Contributor Author

@stevengj & @StefanKarpinski (sorry to drag you into this, but I think you did a lot of the initial strings work)
Some of the requested changes(which are 100% correct) beg the question whether this should really be using The codeunits interface rather than unsafe_wrap.

Initial testing seemed to indicate that CodeUnits is slower, but I wonder if it would be wise to do it the correct way here. Then work on speeding up CodeUnits array.

@StefanKarpinski
Copy link
Member

Happy to be pulled in! It certainly seems cleaner to use codeunits. @JeffBezanson, any idea why that approach would be less efficient?

@ndinsmore
Copy link
Contributor Author

The implementation has changed to using CodeUnits instead of unsafe_wrap now. Once the GC.@preserve had been added to the unsafe_wrap calls, the benchmarks showed that CodeUnits was now faster and on par with the benchmarks before the GC changes were made.

@ndinsmore
Copy link
Contributor Author

@stevengj, @StefanKarpinski & @oscardssmith
There have been a good amount of changes since the initial PR, but I think it is ready to go.

I have reversed or order of operations on the shift dfa which leads to state that is never dirty (ie. you would have to state & UInt64(63) before use) and provided a performance improvement.

Also the use of codeunits has dramatically cleaned up the code and made the implementation more maintainable.

With this PR the processing rate is about 0.5 bytes / cycle for short strings (word to line length) and 1.5 bytes/cycle for longer strings ( file length). This is on par (short) to better (long) than the c implementation.

There are two PRs I plan to follow up with:

  1. Broader use of the DFA throughout strings.jl which should simplify the implementation of length,iterate, nextind, and basically anything else that parses through the utf-8 bytes.
  2. A heuristical to approach to short vs long strings. The current implementation can likely improve short strings 50%, but for long strings, it is fairly easy to get to 10 bytes/cycle for utf-8 and even better 50 bytes/cycle for ascii only strings.

Let me know if there is anything else that needs to be done to this.

@ndinsmore
Copy link
Contributor Author

Everything seems ready to go.

@StefanKarpinski StefanKarpinski added the needs tests Unit tests are required for this change label Jan 27, 2023
@StefanKarpinski
Copy link
Member

This is great work. Now that the implementation has moved to Julia, I think we need much more comprehensive tests of the functionality. Previously we could just assume that byte_string_classify was correct and simply verify that we use it correctly—basically try one ASCII, one non-ASCII UTF-8 and one invalid string. Now that the implementation is here, we need to have comprehensive tests for it, ideally ones that test all the edge cases of this algorithm, which you're in the best position at this point to understand.

@ndinsmore
Copy link
Contributor Author

Now that everything goes the the DFA would you agree that if the tests validate the DFA, we can assume byte_string_classify is validated?

@ndinsmore
Copy link
Contributor Author

Test have been added to validate that the DFA state machine returns state as expected and per the Unicode spec.

@ndinsmore ndinsmore force-pushed the native_utf8_validation branch from 935c46d to 8006d60 Compare April 6, 2023 13:40
@ndinsmore
Copy link
Contributor Author

@mkitti The rebase is done at this point, but thank you.

Does anyone know why these tests are failing?

@mkitti
Copy link
Contributor

mkitti commented Apr 6, 2023

I'm guessing the test failures is unrelated. Could be a problem on master or maybe CI.

@oscardssmith do we want to wait until master can pass tests, or is this mergeable now?

@oscardssmith
Copy link
Member

They look unrelated. I'm happy to merge as is if no one objects in the next day or two.

@ndinsmore
Copy link
Contributor Author

@mkitti & @oscardssmith this is a little off topic, but when I rebase I normally just merge master to my local master and rebase to that. I have had bad luck recently in picking masters that seem to fail tests. So the question is there a tagged master which is always passing all tests?

@oscardssmith
Copy link
Member

not really. in theory, master is supposed to pass tests, but in practice that sometimes doesn't happen.

@mkitti
Copy link
Contributor

mkitti commented Apr 6, 2023

Master does seem to be passing again... we should we rerun the failing test?

Co-authored-by: Steven G. Johnson <stevenj@mit.edu>
@ndinsmore
Copy link
Contributor Author

Is the REPL failure related to this? I am sure it is text heavy but I don't see any of the changed functions in this PR returning a Union?

@oscardssmith
Copy link
Member

I'm slightly scared by this so I'm re-running CI.

@ndinsmore
Copy link
Contributor Author

Looks like it cleared

@oscardssmith oscardssmith merged commit b554e8f into JuliaLang:master Apr 12, 2023
@StefanKarpinski
Copy link
Member

Exciting times! Was considering how to review this, but it's well tested and I think we'll notice if string stuff is broken.

Xnartharax pushed a commit to Xnartharax/julia that referenced this pull request Apr 19, 2023
* Working Native UTF-8 Validation

---------

Co-authored-by: Oscar Smith <oscardssmith@gmail.com>
Co-authored-by: Steven G. Johnson <stevenj@mit.edu>
bors added a commit to rust-lang-ci/rust that referenced this pull request Feb 7, 2025
Rewrite UTF-8 validation in shift-based DFA for 53%~133% performance increase on non-ASCII strings

Take 2 of rust-lang#107760 (cc `@thomcc)`

### Background

About the technique: https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725

As stated in rust-lang#107760,
> For prior art: shift-DFAs are now used for UTF-8 validation in [PostgreSQL](https://github.com/postgres/postgres/blob/aa6954104644334c53838f181053b9f7aa13f58c/src/common/wchar.c#L1753), and seems to be in progress or under consideration for use in JuliaLang/julia#47880 and perhaps golang/go#47120. Of these, PG's impl is the most similar to this one, at least at a high level[1](rust-lang#107760 (comment)).

### Rationales

1. Performance: This algorithm gives plenty of performance increase on validating strings with many non-ASCII codepoints, which is the normal case for almost every non-English content.

2. Generality: It does not use SIMD instructions and does not rely on the branch predictor to get a good performance, thus is good as a general, default, architecture-agnostic implementation. There is still a bypass for ASCII-only strings to benefit from auto-vectorization, if the target supports.

### Implementation details

I use the ordinary UTF-8 language definition from [RFC3692](https://datatracker.ietf.org/doc/html/rfc3629#section-4) and directly translate it into a 9-state DFA. So the compressed state is 64-bit, resulting in a table of `[u64; 256]`, or 2KiB rodata.

The main algorithm consists of following parts:
1. Main loop: taking a chunk of `MAIN_CHUNK_SIZE = 16` bytes on each iteration, execute the DFA on the chunk, and check if the state is in ERROR once per chunk.
2. ASCII bypass: in each chunk iteration, if the current state is ACCEPT, we know we are not in the middle of an encoded sequence, so we can skip a large block of trivial ASCIIs and stop at the first chunk containing any non-ASCII bytes. I choose `ASCII_CHUNK_SIZE = 16` to align with the current implementation: taking 16 bytes each to check non-ASCIIs, to encourage LLVM auto-vectorize it.
3. Trailing chunk and error reporting: execute the DFA step by step, stop on error as soon as possible, and calculate the error/valid location. To be simple, if any error are encountered in the main loop, it will discard the errornous chunk and `break` into this path to find the precise error location. That is, the erronous chunk, if exists, will be traversed twice, in exchange for a tighter and more efficient hot loop.

There are also some small tricks being used:
1. Since we have i686-linux in Tier 1 support, and its 64-bit shift (SHRD) is quite slow in our latency-intensive hot loop, I arrange the state storage so that the state transition can be done in 32-bit shift and conditional move. It shows a 200%+ speed up comparing to 64-bit-shift version.
2. We still need to get UTF-8 encoded length from the first byte in `utf8_char_width`. I merge the previous lookup table into the unused high bits of the DFA transition table. So we don't need two tables. It did introduce an extra 32-bit shift. I believe it's almost free but have not benchmarked yet.

### Benchmarks

I made an [out-of-tree implementation repository](https://github.com/oxalica/shift-dfa-utf8) for easier testing and benching. It also tested various `MAIN_CHUNK_SIZE` (m) and `ASCII_CHUNK_SIZE` (a) configurations. Bench data are taken from the first 4KiB (from the first paragraph, plain text not HTML, cut at char boundary) of Wikipedia [William Shakespeare in en](https://en.wikipedia.org/wiki/William_Shakespeare), [es](https://es.wikipedia.org/wiki/William_Shakespeare) and [zh](https://zh.wikipedia.org/wiki/%E5%A8%81%E5%BB%89%C2%B7%E8%8E%8E%E5%A3%AB%E6%AF%94%E4%BA%9A) language.

In short: with m=16, a=16, shift-DFA performance gives -43% on en, +53% on es, +133% on zh; with m=16, a=32, it gives -9% on en, +26% on es, +33% on zh. It's quite expected, as the larger ASCII bypass chunk is, it performs better on ASCII, but worse on mixed content like es because of the taken branch is flipping around.

To me, the difference between 27GB/s vs 47GB/s in en is minimal in absolute time 144.61ns - 79.86ns = 64.75ns, comparing to 476.05ns - 392.44ns = 83.61ns in es. So I currently chose m=16, a=16 in the PR.

On x86\_64-linux, Ryzen 7 5700G `@3.775GHz:`

| Algorithm         | Input language | Throughput / (GiB/s)  |
|-------------------|----------------|-----------------------|
| std               | en             | 47.768 +-0.301        |
| shift-dfa-m16-a16 | en             | 27.337 +-0.002        |
| shift-dfa-m16-a32 | en             | 43.627 +-0.006        |
| std               | es             |  6.339 +-0.010        |
| shift-dfa-m16-a16 | es             |  9.721 +-0.014        |
| shift-dfa-m16-a32 | es             |  8.013 +-0.009        |
| std               | zh             |  1.463 +-0.000        |
| shift-dfa-m16-a16 | zh             |  3.401 +-0.002        |
| shift-dfa-m16-a32 | zh             |  3.407 +-0.001        |

### Unresolved

- [ ] Benchmark on aarch64-darwin, another tier 1 target.
  I don't have a machine to play with.

- [ ] Decide the chunk size parameters. I'm currently picking m=16, a=16.

- [ ] Should we also replace the implementation of [lossy conversion](https://github.com/oxalica/rust/blob/c0639b8cad126d886ddd88964f729dd33fb90e67/library/core/src/str/lossy.rs#L194) by calling the new validation function?
  It has a very similar code doing almost the same thing.
bors added a commit to rust-lang-ci/rust that referenced this pull request Feb 9, 2025
Rewrite UTF-8 validation in shift-based DFA for 53%~133% performance increase on non-ASCII strings

Take 2 of rust-lang#107760 (cc `@thomcc)`

### Background

About the technique: https://gist.github.com/pervognsen/218ea17743e1442e59bb60d29b1aa725

As stated in rust-lang#107760,
> For prior art: shift-DFAs are now used for UTF-8 validation in [PostgreSQL](https://github.com/postgres/postgres/blob/aa6954104644334c53838f181053b9f7aa13f58c/src/common/wchar.c#L1753), and seems to be in progress or under consideration for use in JuliaLang/julia#47880 and perhaps golang/go#47120. Of these, PG's impl is the most similar to this one, at least at a high level[1](rust-lang#107760 (comment)).

### Rationales

1. Performance: This algorithm gives plenty of performance increase on validating strings with many non-ASCII codepoints, which is the normal case for almost every non-English content.

2. Generality: It does not use SIMD instructions and does not rely on the branch predictor to get a good performance, thus is good as a general, default, architecture-agnostic implementation. There is still a bypass for ASCII-only strings to benefit from auto-vectorization, if the target supports.

### Implementation details

I use the ordinary UTF-8 language definition from [RFC3692](https://datatracker.ietf.org/doc/html/rfc3629#section-4) and directly translate it into a 9-state DFA. So the compressed state is 64-bit, resulting in a table of `[u64; 256]`, or 2KiB rodata.

The main algorithm consists of following parts:
1. Main loop: taking a chunk of `MAIN_CHUNK_SIZE = 16` bytes on each iteration, execute the DFA on the chunk, and check if the state is in ERROR once per chunk.
2. ASCII bypass: in each chunk iteration, if the current state is ACCEPT, we know we are not in the middle of an encoded sequence, so we can skip a large block of trivial ASCIIs and stop at the first chunk containing any non-ASCII bytes. I choose `ASCII_CHUNK_SIZE = 16` to align with the current implementation: taking 16 bytes each to check non-ASCIIs, to encourage LLVM auto-vectorize it.
3. Trailing chunk and error reporting: execute the DFA step by step, stop on error as soon as possible, and calculate the error/valid location. To be simple, if any error are encountered in the main loop, it will discard the errornous chunk and `break` into this path to find the precise error location. That is, the erronous chunk, if exists, will be traversed twice, in exchange for a tighter and more efficient hot loop.

There are also some small tricks being used:
1. Since we have i686-linux in Tier 1 support, and its 64-bit shift (SHRD) is quite slow in our latency-intensive hot loop, I arrange the state storage so that the state transition can be done in 32-bit shift and conditional move. It shows a 200%+ speed up comparing to 64-bit-shift version.
2. We still need to get UTF-8 encoded length from the first byte in `utf8_char_width`. I merge the previous lookup table into the unused high bits of the DFA transition table. So we don't need two tables. It did introduce an extra 32-bit shift. I believe it's almost free but have not benchmarked yet.

### Benchmarks

I made an [out-of-tree implementation repository](https://github.com/oxalica/shift-dfa-utf8) for easier testing and benching. It also tested various `MAIN_CHUNK_SIZE` (m) and `ASCII_CHUNK_SIZE` (a) configurations. Bench data are taken from the first 4KiB (from the first paragraph, plain text not HTML, cut at char boundary) of Wikipedia [William Shakespeare in en](https://en.wikipedia.org/wiki/William_Shakespeare), [es](https://es.wikipedia.org/wiki/William_Shakespeare) and [zh](https://zh.wikipedia.org/wiki/%E5%A8%81%E5%BB%89%C2%B7%E8%8E%8E%E5%A3%AB%E6%AF%94%E4%BA%9A) language.

In short: with m=16, a=16, shift-DFA performance gives -43% on en, +53% on es, +133% on zh; with m=16, a=32, it gives -9% on en, +26% on es, +33% on zh. It's quite expected, as the larger ASCII bypass chunk is, it performs better on ASCII, but worse on mixed content like es because of the taken branch is flipping around.

To me, the difference between 27GB/s vs 47GB/s in en is minimal in absolute time 144.61ns - 79.86ns = 64.75ns, comparing to 476.05ns - 392.44ns = 83.61ns in es. So I currently chose m=16, a=16 in the PR.

On x86\_64-linux, Ryzen 7 5700G `@3.775GHz:`

| Algorithm         | Input language | Throughput / (GiB/s)  |
|-------------------|----------------|-----------------------|
| std               | en             | 47.768 +-0.301        |
| shift-dfa-m16-a16 | en             | 27.337 +-0.002        |
| shift-dfa-m16-a32 | en             | 43.627 +-0.006        |
| std               | es             |  6.339 +-0.010        |
| shift-dfa-m16-a16 | es             |  9.721 +-0.014        |
| shift-dfa-m16-a32 | es             |  8.013 +-0.009        |
| std               | zh             |  1.463 +-0.000        |
| shift-dfa-m16-a16 | zh             |  3.401 +-0.002        |
| shift-dfa-m16-a32 | zh             |  3.407 +-0.001        |

### Unresolved

- [ ] Benchmark on aarch64-darwin, another tier 1 target.
  I don't have a machine to play with.

- [ ] Decide the chunk size parameters. I'm currently picking m=16, a=16.

- [ ] Should we also replace the implementation of [lossy conversion](https://github.com/oxalica/rust/blob/c0639b8cad126d886ddd88964f729dd33fb90e67/library/core/src/str/lossy.rs#L194) by calling the new validation function?
  It has a very similar code doing almost the same thing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster strings "Strings!" unicode Related to unicode characters and encodings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants