Lower regex MaxUnrollSize from 16 to 7#126092
Conversation
|
Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions |
There was a problem hiding this comment.
Pull request overview
This PR adjusts the unrolling threshold used by both the regex compiler and the regex source generator when emitting fixed-count single-character repeaters, favoring vectorized implementations sooner based on benchmarked crossover points.
Changes:
- Lower
MaxUnrollSizefrom 16 to 8 in the compiled regex engine (RegexCompiler). - Lower
MaxUnrollSizefrom 16 to 8 in the regex source generator emitter (RegexGenerator.Emitter).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs | Lowers the unroll-vs-vectorize threshold for compiled regex repeater emission. |
| src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs | Lowers the same threshold for source-generated regex emission to keep behavior aligned. |
|
Just curious, what would it take for IndexOf or whatever that you're calling for >= 8 to do the same fast scalar thing for all lengths, as an implementation detail? ie if (span.Length < 8) // .. scalar
else { VectorizedPath(span); } // not inlinedand the JIT figure out that Slice(foo, 3) or whatever can be expanded to char by char comparison like the source gen is doing here. I guess there's a bunch of pieces, like it would have to propagate the constant through to elide the slice, figure out inlining is worthwhile, and somehow specialize it for constant argument (waving hands as this isn't my domain) |
ad17533 to
76f9012
Compare
|
/ba-g unrelated mono BadExits |
For fixed-count single-character repeaters (e.g.
\d{N}), the regex source generator and compiler choose between unrolling individual character checks vs using vectorized operations likeContainsAnyExcept.MaxUnrollSizewas the threshold controlling this decision — previously set to 16.Benchmarking across multiple character class types shows the crossover point where vectorization wins is consistently between count 4 and 8:
\d)[^x])[abc])[a-zA-Z])At count 8+, vectorized operations win across all character class types. At count ≤4, the unrolled loop wins due to lower overhead and early-exit on mismatch.
This PR lowers the threshold from 16 to 8 in both
RegexGenerator.Emitter.cs(source generator) andRegexCompiler.cs(compiled engine), so that repeaters with counts 9–16 now use vectorized operations instead of unrolled scalar checks.Note
This PR was generated with the assistance of GitHub Copilot.