Skip to content

[API Proposal]: Efficiently matching multiple strings. #69682

Closed as not planned
Closed as not planned
@teo-tsirpanis

Description

@teo-tsirpanis

Background and motivation

As explained in #62447, we want to use more sophisticated string searching algorithms such as Aho-Corasick and the vectorized Teddy in regexes.

Since these algorithms are quite complex, to avoid duplicating their logic in source-generated Regex code, I propose to add a class that performs such searching efficiently, with the more complex pre-processing being performed at construction-time, which will essentially serve the role of an optimized string.IndexOf(string[]).

API Proposal

namespace System.Text;

public sealed class MultiStringMatcher
{
    public ReadOnlySpan<string> Strings { get; }

    public MultiStringMatcher(ReadOnlySpan<string> strings);
    public MultiStringMatcher(params string[] strings);
    public MultiStringMatcher(MultiStringMatcherOptions options, ReadOnlySpan<string> strings);
    public MultiStringMatcher(MultiStringMatcherOptions options, params string[] strings);

    public (int Index, int StringNumber) Find(ReadOnlySpan<char> text);
    public (int Index, int StringNumber) Find(string text);
}

[Flags]
public enum MultiStringMatcherOptions
{
    None = 0,
    CaseInsensitive = 1
}

Creating a MultiStringMatcher accepts an array or span of the strings it will recognize. The Find method accepts a string or read-only span of characters, and returns the position of the longest leftfost match of one of the strings passed to the constructor, and which of the strings it found. If it does not find anything, it will return (-1, -1). These strings are also available for later inspection through the Strings property.

The constructors are overloaded to accept a MultiStringMatcherOptions enum. Currently it allows case-insensitive searching (can be easily done in Aho-Corasick by adding additional edges to the trie, but Teddy won't be used if enabled), and in the future it may be used to enable dynamically generated code for the Aho-Corasick state transitions, similar to RegexOptions.Compiled.

If an empty array of strings is passed to the constructors, the resulting matcher's Find method will always return (-1, -1). If the same string appears in the constructor many times, the Find method could either return the index of the first, the index of the last, or throw at construction time (I'm not a fan of this option).

API Usage

var matcher = new MultiStringMatcher("foo", "bar", "baz");

Console.WriteLine(matcher.Find("foobar")); // Will print (0, 0)

Alternative Designs

An important design decision lies on whether we want this API to be general-purpose and usable by itself, or to provide only what source-generated Regexes need. I imagine the API reviewers want to pursue the latter direction. In this case the API would be simplified to:

// The namespace changed.
namespace System.Text.RegularExpressions;

public sealed class MultiStringMatcher
{
    // It does not cost much to provide this API but could be removed as well.
    public ReadOnlySpan<string> Strings { get; }

    // We remove the user-friendly superfluous overloads.
    // If we support code-generated Aho-Corasick in the future, this overload will allow trimming it away.
    public MultiStringMatcher(ReadOnlySpan<string> strings);
    public MultiStringMatcher(ReadOnlySpan<string> strings, MultiStringMatcherOptions options);

    // Regexes only need the index, not the specific string that was matched.
    public int Find(ReadOnlySpan<char> text);
}

/// MultiStringMatcherOptions stays the same.

If we want this to be a general-purpose API, we have to also think of the following:

  • Is System.Text really a good fit for this? It mostly has to do with text encoding and formatting. My other thought is System but it's already pretty bloated.
  • How about supporting UTF-8? Teddy will particularly enjoy it since twice the characters could fit in the SIMD registers, but it would make supporting case-insensitive comparisons more complicated though.

And do we actually need case-insensitivity?

Risks

The idea that we would construct an object that performs expensive initialization in source-generated code seems a bit paradoxical, given that a purpose of source generators is to avoid expensive initialization. There's no good answer to this, we have to make sure it's worth the performance benefits and tune it appropriately. It would be great if Roslyn allowed generators adding files private for themselves, but that's a big if.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions