[API Proposal]: Efficiently matching multiple strings.

### Background and motivation

As explained in #62447, we want to use more sophisticated string searching algorithms such as [Aho-Corasick](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm) and the vectorized [Teddy](https://github.com/jneem/teddy) in regexes.

Since these algorithms are quite complex, to avoid duplicating their logic in source-generated `Regex` code, I propose to add a class that performs such searching efficiently, with the more complex pre-processing being performed at construction-time, which will essentially serve the role of an optimized `string.IndexOf(string[])`.

### API Proposal

```csharp
namespace System.Text;

public sealed class MultiStringMatcher
{
    public ReadOnlySpan<string> Strings { get; }

    public MultiStringMatcher(ReadOnlySpan<string> strings);
    public MultiStringMatcher(params string[] strings);
    public MultiStringMatcher(MultiStringMatcherOptions options, ReadOnlySpan<string> strings);
    public MultiStringMatcher(MultiStringMatcherOptions options, params string[] strings);

    public (int Index, int StringNumber) Find(ReadOnlySpan<char> text);
    public (int Index, int StringNumber) Find(string text);
}

[Flags]
public enum MultiStringMatcherOptions
{
    None = 0,
    CaseInsensitive = 1
}
```

Creating a `MultiStringMatcher` accepts an array or span of the strings it will recognize. The `Find` method accepts a string or read-only span of characters, and returns the position of the longest leftfost match of one of the strings passed to the constructor, and which of the strings it found. If it does not find anything, it will return `(-1, -1)`. These strings are also available for later inspection through the `Strings` property.

The constructors are overloaded to accept a `MultiStringMatcherOptions` enum. Currently it allows case-insensitive searching (can be easily done in Aho-Corasick by adding additional edges to the trie, but Teddy won't be used if enabled), and in the future it may be used to enable dynamically generated code for the Aho-Corasick state transitions, similar to `RegexOptions.Compiled`.

If an empty array of strings is passed to the constructors, the resulting matcher's `Find` method will always return `(-1, -1)`. If the same string appears in the constructor many times, the `Find` method could either return the index of the first, the index of the last, or throw at construction time (I'm not a fan of this option).

### API Usage

```csharp
var matcher = new MultiStringMatcher("foo", "bar", "baz");

Console.WriteLine(matcher.Find("foobar")); // Will print (0, 0)
```

### Alternative Designs

An important design decision lies on whether we want this API to **be general-purpose and usable by itself**, or to **provide only what source-generated `Regex`es need**. I imagine the API reviewers want to pursue the latter direction. In this case the API would be simplified to:

``` csharp
// The namespace changed.
namespace System.Text.RegularExpressions;

public sealed class MultiStringMatcher
{
    // It does not cost much to provide this API but could be removed as well.
    public ReadOnlySpan<string> Strings { get; }

    // We remove the user-friendly superfluous overloads.
    // If we support code-generated Aho-Corasick in the future, this overload will allow trimming it away.
    public MultiStringMatcher(ReadOnlySpan<string> strings);
    public MultiStringMatcher(ReadOnlySpan<string> strings, MultiStringMatcherOptions options);

    // Regexes only need the index, not the specific string that was matched.
    public int Find(ReadOnlySpan<char> text);
}

/// MultiStringMatcherOptions stays the same.
```

If we want this to be a general-purpose API, we have to also think of the following:

* Is `System.Text` really a good fit for this? It mostly has to do with text encoding and formatting. My other thought is `System` but it's already pretty bloated.
* How about supporting UTF-8? Teddy will particularly enjoy it since twice the characters could fit in the SIMD registers, but it would make supporting case-insensitive comparisons more complicated though.

And do we actually need case-insensitivity?

### Risks

The idea that we would construct an object that performs expensive initialization in source-generated code seems a bit paradoxical, given that a purpose of source generators is to avoid expensive initialization. There's no good answer to this, we have to make sure it's worth the performance benefits and tune it appropriately. It would be great if Roslyn allowed generators adding files private for themselves, but that's a big if.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[API Proposal]: Efficiently matching multiple strings. #69682

Background and motivation

API Proposal

API Usage

Alternative Designs

Risks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[API Proposal]: Efficiently matching multiple strings. #69682

Description

Background and motivation

API Proposal

API Usage

Alternative Designs

Risks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions