Skip to content

[API Proposal]: IndexOfAnyValues<string> #85573

Closed
@MihaZupan

Description

@MihaZupan

Background and motivation

Earlier in .NET 8, we added an IndexOfAnyValues<T> type that represents an immutable set of values optimized for efficient {Last}IndexOfAny{Except} searching: #68328 (comment).
You obtain an instance of IndexOfAnyValues by passing the full set of values to Create, which picks the most efficient algorithm for that set of values and capabilities of the current platform.

We've already exposed Create methods for searching for any byte or char in a given set.
This proposal would allow you to create IndexOfAnyValues<string> instances for searching for multiple substrings in a given text, instead of just individual characters.
Like with bytes and chars, Create for IndexOfAnyValues<string> can analyze the values in advance and pick the most optimal algorithm (e.g., Aho-Corasick, Rabin-Karp, Teddy, ...).

As we're working with strings, we're proposing adding a StringComparison parameter to Create.
This method would only work for Ordinal/OrdinalIgnoreCase. It's unlikely that we can meaningfully accelerate culture-aware multi-substring searching, and the proposed APIs assume that matches are of equal lengths.

Semantically, IndexOfAny would behave the same as doing an IndexOf of every value, and taking the minimum index.

While a *Last*IndexOfAny variant is possible, we are currently only proposing the left-to-right IndexOfAny.

Regex is likely to be the main consumer of this API.

cc: @stephentoub
cc: @tarekgh, @GrabYourPitchforks as we're working with strings

API Proposal

namespace System.Buffers;

public static class IndexOfAnyValues
{
    // Existing
    public static IndexOfAnyValues<byte> Create(ReadOnlySpan<byte> values);
    public static IndexOfAnyValues<char> Create(ReadOnlySpan<char> values);

    // Proposed
    public static IndexOfAnyValues<string> Create(ReadOnlySpan<string> values, StringComparison comparisonType);
}
namespace System;

public static class MemoryExtensions
{
    // Existing
    public static int IndexOfAny<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values);
    public static int IndexOfAnyExcept<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values);
    public static int LastIndexOfAny<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values);
    public static int LastIndexOfAnyExcept<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values);

    // Proposed
    public static int IndexOfAny(this ReadOnlySpan<char> span, IndexOfAnyValues<string> values);
}

API Usage

private static readonly IndexOfAnyValues<string> s_names = IndexOfAnyValues.Create(
    new[] { "Sherlock", "Holmes", "Watson" }, StringComparison.Ordinal);

public static int CountNames(ReadOnlySpan<char> text)
{
    int count = 0;

    while (!text.IsEmpty)
    {
        int matchOffset = text.IndexOfAny(s_names);

        if ((uint)matchOffset >= (uint)text.Length) break;

        int matchLength = text[matchOffset] == 'S' ? 8 : 6;

        text = text.Slice(matchOffset + matchLength);
        count++;
    }

    return count;
}

Alternative Designs

Emphasize that only Ordinal{IgnoreCase} works:

public static IndexOfAnyValues<string> CreateOrdinal(ReadOnlySpan<string> values, StringComparison comparisonType);
// or
public static IndexOfAnyValues<string> CreateOrdinal(ReadOnlySpan<string> values, bool ignoreCase);

An API that returns both a match offset and length (instead of just the offset).
So far we've rejected this variant as it adds a policy to IndexOfAny semantics -- are we returning the leftmost-first vs leftmost-longest match.

public static (int MatchOffset, int MatchLength) IndexOfAny(this ReadOnlySpan<char> span, IndexOfAnyValues<string> values);

Related issues: #69682, #62447

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions