Skip to content

Commit ba4c35d

Browse files
authored
Fix RegexCompiler regression on 32-bit for some set matching (#68655)
We added an optimization to regex where for sets containing values all within 64 characters of each other (e.g. all hex digits), we use a ulong to represent a bitmap and can implement the check in an entirely branchless manner. This results in a measurable win on 64-bit, e.g. upwards of 20% for some patterns. Unfortunately, it also results in a measurable regression on 32-bit. This PR fixes that for RegexCompiler by special-casing the optimization to only apply when IntPtr.Size == 8. For the source generator, we don't have the same luxury of knowing that the code is emitted and used on the same bitness, so since it would result in very complicated code to emit multiple implementations and since we generally prefer optimizing for 64-bit, I've left it in for the source generator.
1 parent 74b7d55 commit ba4c35d

File tree

2 files changed

+7
-1
lines changed

2 files changed

+7
-1
lines changed

src/libraries/System.Text.RegularExpressions/gen/RegexGenerator.Emitter.cs

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4331,6 +4331,10 @@ private static string MatchCharacterClass(RegexOptions options, string chExpr, s
43314331
// Next, handle sets where the high - low + 1 range is <= 64. In that case, we can emit
43324332
// a branchless lookup in a ulong that does not rely on loading any objects (e.g. the string-based
43334333
// lookup we use later). This nicely handles common sets like [0-9A-Fa-f], [0-9a-f], [A-Za-z], etc.
4334+
// Note that unlike RegexCompiler, the source generator doesn't know whether the code is going to be
4335+
// run in a 32-bit or 64-bit process: in a 64-bit process, this is an optimization, but in a 32-bit process,
4336+
// it's a deoptimization. In general we optimize for 64-bit perf, so this code remains; it complicates
4337+
// the code too much to try to include both this and a fallback for the check.
43344338
if (analysis.OnlyRanges && (analysis.UpperBoundExclusiveIfOnlyRanges - analysis.LowerBoundInclusiveIfOnlyRanges) <= 64)
43354339
{
43364340
additionalDeclarations.Add("ulong charMinusLow;");

src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexCompiler.cs

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5203,7 +5203,9 @@ private void EmitMatchCharacterClass(string charClass)
52035203
// Next, handle sets where the high - low + 1 range is <= 64. In that case, we can emit
52045204
// a branchless lookup in a ulong that does not rely on loading any objects (e.g. the string-based
52055205
// lookup we use later). This nicely handles common sets like [0-9A-Fa-f], [0-9a-f], [A-Za-z], etc.
5206-
if (analysis.OnlyRanges && (analysis.UpperBoundExclusiveIfOnlyRanges - analysis.LowerBoundInclusiveIfOnlyRanges) <= 64)
5206+
// We skip this on 32-bit, as otherwise using 64-bit numbers in this manner is a deoptimization
5207+
// when compared to the subsequent fallbacks.
5208+
if (IntPtr.Size == 8 && analysis.OnlyRanges && (analysis.UpperBoundExclusiveIfOnlyRanges - analysis.LowerBoundInclusiveIfOnlyRanges) <= 64)
52075209
{
52085210
// Create the 64-bit value with 1s at indices corresponding to every character in the set,
52095211
// where the bit is computed to be the char value minus the lower bound starting from

0 commit comments

Comments
 (0)