Description
Background and motivation
As explained in #62447, we want to use more sophisticated string searching algorithms such as Aho-Corasick and the vectorized Teddy in regexes.
Since these algorithms are quite complex, to avoid duplicating their logic in source-generated Regex
code, I propose to add a class that performs such searching efficiently, with the more complex pre-processing being performed at construction-time, which will essentially serve the role of an optimized string.IndexOf(string[])
.
API Proposal
namespace System.Text;
public sealed class MultiStringMatcher
{
public ReadOnlySpan<string> Strings { get; }
public MultiStringMatcher(ReadOnlySpan<string> strings);
public MultiStringMatcher(params string[] strings);
public MultiStringMatcher(MultiStringMatcherOptions options, ReadOnlySpan<string> strings);
public MultiStringMatcher(MultiStringMatcherOptions options, params string[] strings);
public (int Index, int StringNumber) Find(ReadOnlySpan<char> text);
public (int Index, int StringNumber) Find(string text);
}
[Flags]
public enum MultiStringMatcherOptions
{
None = 0,
CaseInsensitive = 1
}
Creating a MultiStringMatcher
accepts an array or span of the strings it will recognize. The Find
method accepts a string or read-only span of characters, and returns the position of the longest leftfost match of one of the strings passed to the constructor, and which of the strings it found. If it does not find anything, it will return (-1, -1)
. These strings are also available for later inspection through the Strings
property.
The constructors are overloaded to accept a MultiStringMatcherOptions
enum. Currently it allows case-insensitive searching (can be easily done in Aho-Corasick by adding additional edges to the trie, but Teddy won't be used if enabled), and in the future it may be used to enable dynamically generated code for the Aho-Corasick state transitions, similar to RegexOptions.Compiled
.
If an empty array of strings is passed to the constructors, the resulting matcher's Find
method will always return (-1, -1)
. If the same string appears in the constructor many times, the Find
method could either return the index of the first, the index of the last, or throw at construction time (I'm not a fan of this option).
API Usage
var matcher = new MultiStringMatcher("foo", "bar", "baz");
Console.WriteLine(matcher.Find("foobar")); // Will print (0, 0)
Alternative Designs
An important design decision lies on whether we want this API to be general-purpose and usable by itself, or to provide only what source-generated Regex
es need. I imagine the API reviewers want to pursue the latter direction. In this case the API would be simplified to:
// The namespace changed.
namespace System.Text.RegularExpressions;
public sealed class MultiStringMatcher
{
// It does not cost much to provide this API but could be removed as well.
public ReadOnlySpan<string> Strings { get; }
// We remove the user-friendly superfluous overloads.
// If we support code-generated Aho-Corasick in the future, this overload will allow trimming it away.
public MultiStringMatcher(ReadOnlySpan<string> strings);
public MultiStringMatcher(ReadOnlySpan<string> strings, MultiStringMatcherOptions options);
// Regexes only need the index, not the specific string that was matched.
public int Find(ReadOnlySpan<char> text);
}
/// MultiStringMatcherOptions stays the same.
If we want this to be a general-purpose API, we have to also think of the following:
- Is
System.Text
really a good fit for this? It mostly has to do with text encoding and formatting. My other thought isSystem
but it's already pretty bloated. - How about supporting UTF-8? Teddy will particularly enjoy it since twice the characters could fit in the SIMD registers, but it would make supporting case-insensitive comparisons more complicated though.
And do we actually need case-insensitivity?
Risks
The idea that we would construct an object that performs expensive initialization in source-generated code seems a bit paradoxical, given that a purpose of source generators is to avoid expensive initialization. There's no good answer to this, we have to make sure it's worth the performance benefits and tune it appropriately. It would be great if Roslyn allowed generators adding files private for themselves, but that's a big if.