Skip to content

ieviev/resharp-dotnet

Repository files navigation

RE#

NuGet

A high-performance, automata based regex engine with first-class support for intersection and complement operations.

RE# compiles patterns into deterministic automata. All matching is non-backtracking with guaranteed linear-time execution. RE# extends System.Text.RegularExpressions syntax with intersection (&), complement (~), and a universal wildcard (_), enabling patterns that are impossible or impractical to express with standard regex.

web playground | paper | blog post

Install

dotnet add package Resharp

Usage

// contains "cat", "dog", AND is 8-15 characters long
var re = new Resharp.Regex(@".*cat.*&.*dog.*&.{8,15}");

// instances are thread-safe, compile once and reuse
re.Matches("the cat and the dog");

Syntax extensions

RE# supports standard .NET regex syntax plus three extensions:

_ -- universal wildcard

Matches any character including newlines ([\s\S]).

& -- intersection

Both sides must match. The match is the intersection of the two languages.

_*cat_*&_*dog_*       contains both "cat" and "dog"
_*cat_*&_*dog_*&_{5,30}  ...and is 5-30 characters long

~(...) -- complement

Matches everything the inner pattern does not match.

~(_*\d\d_*)     does not contain two consecutive digits
~(_*\n\n_*)     does not contain a double newline
why `_*` and not `.*`? we specifically included `_` in the syntax so it's more intuitive to use with complement. `.*` does not match newlines, so it does not mean "any string" (rather "any one line").

Combining operators

F.*&~(_*Finn)                starts with 'F', does not end with "Finn"
~(_*\d\d_*)&[a-zA-Z\d]{8,}  8+ alphanumeric, no consecutive digits

Performance

RE# uses several optimizations: start-set inference, literal prefix scanning, and optional full DFA precompilation. RE# shares many optimizations with .NET's (and RegexOptions.NonBacktracking even shares some RE# techniques and strengths that many do not know of!) but RE# is designed from the ground up and returns a different kind of matches (leftmost-longest).

RE# particularly excels with large patterns and will often outperform .NET's regex engine (and all others) for complex patterns, especially those with a large set of alternatives, loops or using context-awareness - RE# supports lookarounds, which is unique among automata engines.

To illustrate, here is a little comparison of RE# with .NET's most used compiled and source-generated regex engines on these patterns, you can also find wider comparisons in the paper:

On curated benchmarks from rebar (AMD Ryzen 7 5800X, .NET 10.0):

Pattern RE# .NET Compiled .NET SourceGenerated Speedup
date extraction 1,737 us 273,822 us 318,070 us 158x
dictionary search 105 us 45,832 us 26,410 us 252x

And on some extensions we added ourselves:

Pattern RE# .NET Compiled .NET SourceGenerated Speedup
dictionary, case-insensitive 576 us 29,368 us 21,146 us 37x
unicode dictionary 336 us 62,053 us 38,613 us 115x
unicode dictionary, case-insensitive 321 us 484,135 us 537,814 us 1,508x
dictionary + context window 621 us 48,893 us 55,383 us 79x
dictionary + context window, unicode 692 us 24,105,091 us 34,706,982 us 34,833x
Where is RegexOptions.NonBacktracking?

Conveniently left out for shock effect :^). NonBacktracking is actually much closer to RE#, but still behind. See the paper for a fairer comparison.

For critical paths, you can use ValueMatches for memory-pooled matching and ResharpOptions.HighThroughputDefaults for more aggressive optimization.

var re = new Resharp.Regex("pattern", ResharpOptions.HighThroughputDefaults);
using var slices = re.ValueMatches(chars);
foreach (var s in slices)
    Console.WriteLine($"match at {s.Index}..{s.Index + s.Length}");

Documentation

Examples

Runnable scripts in examples/:

File Description
basic-syntax.fsx wildcards, intersection, complement
paragraph.fsx paragraph extraction with complement and intersection
validation.fsx date, IP, password validation with intersection
replace.fsx string and function-based replacement
lookaround.fsx lookahead, lookbehind, combined with intersection
high-throughput.fsx zero-allocation matching for large inputs
Basic.cs C# usage

Have fun!

About

RE# - A high-performance, automata based regex engine with first-class support for intersection and complement operations.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages