-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
Convert String processing to DFA for improved performace #48887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some typos in comments
Very cool!! |
#48568 is now merged so hack away! |
I think that there is still a conversation to be had that was started with @StefanKarpinski as to IUTF-8 is the right name for this string specification. I would like to have some consensus before putting it in master. Also should how to interpret the state machine be covered in the comments more? |
@jakobnissen & @stevengj you guys seems to do a lot of work in the in broader string ecosystems. Do you have any benchmarks that you could run with this to see if there is a similar down stream benefit for benchmarks? |
I don't have any benchmarks for this, sorry. It looks like it's |
I have changed the labeling of IUTF8 to GUTF8 this is short for Generalized UTF-8. This comes after the realization that the state machine is executing the Generalized UTF-8 states. IUTF8 arises when you do the processing of invalid bytes, as Julia does. |
@oscardssmith & @StefanKarpinski |
In principle this seems good to me, but I haven't actually reviewed it in detail. |
@oscardssmith I rebased this to because I wanted to include the work you did in #48996, but now it is failing two test that I don't fully understand how to fix.
|
The failing tests are the ones I added in #48996. The problem is that our compiler completely gives up on proving termination in the presence of loops. |
@nanosoldier |
Let's run a PkgEval and merge if it comes back green. |
I wonder how hard it would be to add a special case for a loop over an integer range. |
Your package evaluation job has completed - possible new issues were detected. |
The StrBase error should probably be looked at. |
Yeah, it is a bug here somewhere. With this PR:
It's pretty amazing there is no test for this. |
Cleanup Update base/strings/string.jl Co-authored-by: Mosè Giordano <giordano@users.noreply.github.com> Update base/strings/string.jl Co-authored-by: Mosè Giordano <giordano@users.noreply.github.com>
Iterate bug fix
Change condition
After the rebase, the results still show benefits with minor regressions. I think that a lot of the regressions can be fixed with some further changes. 7-element BenchmarkTools.BenchmarkGroup:
tags: []
"isascii" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"julia 1.9 source" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"files SubString" => TrialJudgement(-0.49% => invariant)
"files" => TrialJudgement(+16.69% => regression)
"lines" => TrialJudgement(-0.11% => invariant)
"lines SubString" => TrialJudgement(+0.39% => invariant)
"single nonASCII" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"length 64:512" => TrialJudgement(-7.08% => invariant)
"length 2:64" => TrialJudgement(-9.20% => invariant)
"length 4096:32768" => TrialJudgement(-7.40% => invariant)
"length 512:4096" => TrialJudgement(-4.67% => invariant)
"ASCII" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"length 64:512" => TrialJudgement(-6.88% => invariant)
"length 2:64" => TrialJudgement(+10.15% => invariant)
"length 4096:32768" => TrialJudgement(+1.76% => invariant)
"length 512:4096" => TrialJudgement(+2.03% => invariant)
"Unicode" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"length 64:512" => TrialJudgement(-23.46% => improvement)
"length 2:64" => TrialJudgement(-14.29% => invariant)
"length 4096:32768" => TrialJudgement(-1.69% => invariant)
"length 512:4096" => TrialJudgement(-2.55% => invariant)
"isvalid" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"julia 1.9 source" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"files SubString" => TrialJudgement(+0.10% => invariant)
"files" => TrialJudgement(-0.81% => invariant)
"lines" => TrialJudgement(+0.40% => invariant)
"lines SubString" => TrialJudgement(+3.14% => invariant)
"single nonASCII" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"length 64:512" => TrialJudgement(-0.09% => invariant)
"length 2:64" => TrialJudgement(-0.22% => invariant)
"length 4096:32768" => TrialJudgement(+0.43% => invariant)
"length 512:4096" => TrialJudgement(-0.22% => invariant)
"ASCII" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"length 64:512" => TrialJudgement(-0.20% => invariant)
"length 2:64" => TrialJudgement(+0.80% => invariant)
"length 4096:32768" => TrialJudgement(-0.45% => invariant)
"length 512:4096" => TrialJudgement(+1.06% => invariant)
"Unicode" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"length 64:512" => TrialJudgement(-0.36% => invariant)
"length 2:64" => TrialJudgement(-1.38% => invariant)
"length 4096:32768" => TrialJudgement(+0.02% => invariant)
"length 512:4096" => TrialJudgement(+0.00% => invariant)
"length" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"julia 1.9 source" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"files SubString" => TrialJudgement(-78.61% => improvement)
"files" => TrialJudgement(-85.06% => improvement)
"lines" => TrialJudgement(-35.68% => improvement)
"lines SubString" => TrialJudgement(-12.63% => invariant)
"single nonASCII" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"length 64:512" => TrialJudgement(-22.90% => improvement)
"length 2:64" => TrialJudgement(+33.49% => regression)
"length 4096:32768" => TrialJudgement(-95.13% => improvement)
"length 512:4096" => TrialJudgement(-84.73% => improvement)
"ASCII" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"length 64:512" => TrialJudgement(-93.42% => improvement)
"length 2:64" => TrialJudgement(-56.31% => improvement)
"length 4096:32768" => TrialJudgement(-97.34% => improvement)
"length 512:4096" => TrialJudgement(-96.81% => improvement)
"Unicode" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"length 64:512" => TrialJudgement(+135.01% => regression)
"length 2:64" => TrialJudgement(+83.94% => regression)
"length 4096:32768" => TrialJudgement(+124.23% => regression)
"length 512:4096" => TrialJudgement(+117.76% => regression)
"thisind" => 3-element BenchmarkTools.BenchmarkGroup:
tags: []
"ascii" => TrialJudgement(+3.37% => invariant)
"unicode" => TrialJudgement(+30.63% => regression)
"malformed" => TrialJudgement(-25.35% => improvement)
"nextind" => 3-element BenchmarkTools.BenchmarkGroup:
tags: []
"ascii" => TrialJudgement(+0.51% => invariant)
"unicode" => TrialJudgement(-15.73% => improvement)
"malformed" => TrialJudgement(-40.26% => improvement)
"iterate" => 3-element BenchmarkTools.BenchmarkGroup:
tags: []
"ascii" => TrialJudgement(-5.16% => invariant)
"unicode" => TrialJudgement(+47.32% => regression)
"malformed" => TrialJudgement(-35.32% => improvement)
"getindex" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"4-byte" => TrialJudgement(+6.87% => invariant)
"3-byte" => TrialJudgement(+0.00% => invariant)
"1-byte" => TrialJudgement(+0.00% => invariant)
"2-byte" => TrialJudgement(+0.51% => invariant)
` |
I say we do it! |
There are a few steps I would like to take with this code before merging.
|
Using Memory for lookup tables is not faster than Vector as the memory indirection seem to be compiled away, but Tuples may be (they were last I benchmarked a different application with a length 256 UInt8 LUT) |
Using Memory seems potentially better for something like this because it's just less overhead both in terms of the data structure and in terms of not requiring any analysis to prove that the length or location doesn't change. Tuples might have the advantage of being immutable but we could maybe have an immutable variant of Memory. |
Or just use a Tuple? IIRC that's what has worked best in these type of lookup tables (cf #37802) |
The switch to tuples showed no real benefit, but if it is the recommended option I will stick with it. Vector vs Tuple Results7-element BenchmarkTools.BenchmarkGroup:
tags: []
"isascii" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"julia 1.9 source" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"files SubString" => TrialJudgement(-0.11% => invariant)
"files" => TrialJudgement(+7.60% => invariant)
"lines" => TrialJudgement(-2.13% => invariant)
"lines SubString" => TrialJudgement(-4.75% => invariant)
"single nonASCII" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"length 64:512" => TrialJudgement(+8.68% => invariant)
"length 2:64" => TrialJudgement(+0.88% => invariant)
"length 4096:32768" => TrialJudgement(+1.59% => invariant)
"length 512:4096" => TrialJudgement(+1.14% => invariant)
"ASCII" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"length 64:512" => TrialJudgement(+7.86% => invariant)
"length 2:64" => TrialJudgement(+15.09% => regression)
"length 4096:32768" => TrialJudgement(+0.45% => invariant)
"length 512:4096" => TrialJudgement(+0.01% => invariant)
"Unicode" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"length 64:512" => TrialJudgement(+41.63% => regression)
"length 2:64" => TrialJudgement(-1.33% => invariant)
"length 4096:32768" => TrialJudgement(-4.62% => invariant)
"length 512:4096" => TrialJudgement(+4.07% => invariant)
"isvalid" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"julia 1.9 source" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"files SubString" => TrialJudgement(+15.72% => regression)
"files" => TrialJudgement(+1.23% => invariant)
"lines" => TrialJudgement(+7.68% => invariant)
"lines SubString" => TrialJudgement(+3.67% => invariant)
"single nonASCII" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"length 64:512" => TrialJudgement(-0.04% => invariant)
"length 2:64" => TrialJudgement(+7.08% => invariant)
"length 4096:32768" => TrialJudgement(+4.18% => invariant)
"length 512:4096" => TrialJudgement(+1.35% => invariant)
"ASCII" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"length 64:512" => TrialJudgement(+5.09% => invariant)
"length 2:64" => TrialJudgement(+9.21% => invariant)
"length 4096:32768" => TrialJudgement(+0.45% => invariant)
"length 512:4096" => TrialJudgement(+1.43% => invariant)
"Unicode" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"length 64:512" => TrialJudgement(+6.89% => invariant)
"length 2:64" => TrialJudgement(+5.75% => invariant)
"length 4096:32768" => TrialJudgement(+7.14% => invariant)
"length 512:4096" => TrialJudgement(+6.86% => invariant)
"length" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"julia 1.9 source" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"files SubString" => TrialJudgement(+3.84% => invariant)
"files" => TrialJudgement(+3.90% => invariant)
"lines" => TrialJudgement(+3.35% => invariant)
"lines SubString" => TrialJudgement(+4.22% => invariant)
"single nonASCII" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"length 64:512" => TrialJudgement(-2.74% => invariant)
"length 2:64" => TrialJudgement(-1.64% => invariant)
"length 4096:32768" => TrialJudgement(-1.19% => invariant)
"length 512:4096" => TrialJudgement(-2.03% => invariant)
"ASCII" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"length 64:512" => TrialJudgement(+2.46% => invariant)
"length 2:64" => TrialJudgement(+1.80% => invariant)
"length 4096:32768" => TrialJudgement(+0.87% => invariant)
"length 512:4096" => TrialJudgement(+0.36% => invariant)
"Unicode" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"length 64:512" => TrialJudgement(-1.00% => invariant)
"length 2:64" => TrialJudgement(-0.61% => invariant)
"length 4096:32768" => TrialJudgement(+8.12% => invariant)
"length 512:4096" => TrialJudgement(+14.24% => invariant)
"thisind" => 3-element BenchmarkTools.BenchmarkGroup:
tags: []
"ascii" => TrialJudgement(+5.74% => invariant)
"unicode" => TrialJudgement(+5.37% => invariant)
"malformed" => TrialJudgement(-0.62% => invariant)
"nextind" => 3-element BenchmarkTools.BenchmarkGroup:
tags: []
"ascii" => TrialJudgement(+7.18% => invariant)
"unicode" => TrialJudgement(+7.91% => invariant)
"malformed" => TrialJudgement(+7.23% => invariant)
"iterate" => 3-element BenchmarkTools.BenchmarkGroup:
tags: []
"ascii" => TrialJudgement(+8.22% => invariant)
"unicode" => TrialJudgement(+7.63% => invariant)
"malformed" => TrialJudgement(+8.73% => invariant)
"getindex" => 4-element BenchmarkTools.BenchmarkGroup:
tags: []
"4-byte" => TrialJudgement(+4.35% => invariant)
"3-byte" => TrialJudgement(+12.28% => invariant)
"1-byte" => TrialJudgement(+0.85% => invariant)
"2-byte" => TrialJudgement(+0.01% => invariant) <\details> |
If they're the same, speed-wise, then Memory might be better for the compiler, since it avoids creating unnecessary specializations. But 🤷 no big deal either way. |
What specialization? |
That depends on when the tuples are created and what is done with them after. The code suggests that it would generate at least But the use of tuples for top-level constants in Julia does lead to a lot of code. E.g. there are >600 specializations for |
All the code was converted to tuples in the constant creation by changing the return value to |
As I am only testing these on an M3 Macbook Pro it might be prudent to have a x86 user run the benchmarks as well. The repo for the benchmarks are ndinsmore/StringBenchmarks. using master using BenchmarkTools
using StringBenchmarks
results = run(StringBenchmarks.SUITE)
BenchmarkTools.save("baseline.json",results) using iutf branch using BenchmarkTools
using StringBenchmarks
results = run(StringBenchmarks.SUITE)
baseline_results=BenchmarkTools.load("baseline.json")[1]
BenchmarkTools.judge(minimum(results),minimum(baseline_results)) |
Does anyone remember why the length function doesn't just count the continuation bytes with If we assume validation of strings is done at creation, this method combined with the speed of |
Strings are not validated in Julia. |
Out of curiosity, when is it necessary to run the FSA in reverse? |
I wonder if a code-based (as opposed to table-based) FSA implementation was evaluated? The number of states is small, so this should be very manageable and result in better performance. It's what I do in ParseUnparseJSON.jl to tokenize JSON. |
This PR changes most of the processing of the base string case over to using a DFA State Machine that can run both forward and reverse. This is similar to the DFA used #47880.
Also introduce the idea that Julia's base Strings use a format we call inclusive/invalid UTF-8 or IUTF-8. IUTF-8 Is a superset of UTF-8 and encodes how Julia will split up the code units of a string.
This PR would be simplified by merging #48568 first as it borrows methods that were used in that. Furthermore with that PR merged, there will be speedups seen in most of the important methods for string.
Performance benchmarks using ndinsmore/StringBenchmarks show performance benefits of around 20% nearly across the board with
length
showing a benefit of 20x for long ascii strings.Update to include benchmarks