Skip to content

Consider Span<char> overloads on Regex classes #23602

Closed
@danmoseley

Description

@danmoseley

edit by @ViktorHofer, moved initial post down.

Spanifying Regex removes (a) unnecessary string allocations that tend to decrease perf and (b) allows different types of Memory to be processed.
API proposal and implementation: ViktorHofer/corefx#1

Proposed APIs

This diff contains the Memory overloads and the MatchEvaluator overloads. See discussion above if we should introduce new ref types for Match, Group & Capture.

namespace System.Text.RegularExpressions
{
    public partial class Capture
    {
        internal Capture() { }
        public int Index { get { throw null; } }
        public int Length { get { throw null; } }
        public string Value { get { throw null; } }
+       public ReadOnlySpan<char> ValueSpan { get { throw null; } }
        public override string ToString() { throw null; }
    }
    public partial class Match : System.Text.RegularExpressions.Group
    {
        internal Match() { }
        public static System.Text.RegularExpressions.Match Empty { get { throw null; } }
        public virtual System.Text.RegularExpressions.GroupCollection Groups { get { throw null; } }
        public System.Text.RegularExpressions.Match NextMatch() { throw null; }
        public virtual string Result(string replacement) { throw null; }
+       public virtual bool TryResult(string replacement, Span<char> destination, out int charsWritten) { throw null; }
        public static System.Text.RegularExpressions.Match Synchronized(System.Text.RegularExpressions.Match inner) { throw null; }
    }
    public partial class Regex : System.Runtime.Serialization.ISerializable
    {
+       public ref struct SplitEnumerator
+       {
+           public ReadOnlySpan<char> Current { get { throw null; } }
+           public SplitEnumerator GetEnumerator() { throw null; }
+           public bool MoveNext() { throw null; }
+       }
        protected internal System.Collections.Hashtable caps;
        protected internal System.Collections.Hashtable capnames;
        protected internal int capsize;
        protected internal string[] capslist;
        protected internal System.Text.RegularExpressions.RegexRunnerFactory factory;
        public static readonly System.TimeSpan InfiniteMatchTimeout;
        protected internal System.TimeSpan internalMatchTimeout;
        protected internal string pattern;
        protected internal System.Text.RegularExpressions.RegexOptions roptions;
        protected Regex() { }
        protected Regex(System.Runtime.Serialization.SerializationInfo info, System.Runtime.Serialization.StreamingContext context) { }
        public Regex(string pattern) { }
        public Regex(string pattern, System.Text.RegularExpressions.RegexOptions options) { }
        public Regex(string pattern, System.Text.RegularExpressions.RegexOptions options, System.TimeSpan matchTimeout) { }
        public static int CacheSize { get { throw null; } set { } }
        [System.CLSCompliant(false)]
        protected System.Collections.IDictionary Caps { get { throw null; } set { } }
        [System.CLSCompliant(false)]
        protected System.Collections.IDictionary CapNames { get { throw null; } set { } }
        public System.TimeSpan MatchTimeout { get { throw null; } }
        public System.Text.RegularExpressions.RegexOptions Options { get { throw null; } }
        public bool RightToLeft { get { throw null; } }
        public static string Escape(string str) { throw null; }
+       public static bool TryEscape(ReadOnlySpan<char> str, Span<char> destination, out int charsWritten) { throw null; }
        public string[] GetGroupNames() { throw null; }
        public int[] GetGroupNumbers() { throw null; }
        public string GroupNameFromNumber(int i) { throw null; }
        public int GroupNumberFromName(string name) { throw null; }
        protected void InitializeReferences() { }
        public bool IsMatch(string input) { throw null; }
+       public bool IsMatch(ReadOnlySpan<char> input) { throw null; }
        public bool IsMatch(string input, int startat) { throw null; }
        public static bool IsMatch(string input, string pattern) { throw null; }
        public static bool IsMatch(string input, string pattern, System.Text.RegularExpressions.RegexOptions options) { throw null; }
        public static bool IsMatch(string input, string pattern, System.Text.RegularExpressions.RegexOptions options, System.TimeSpan matchTimeout) { throw null; }
+       public static bool IsMatch(ReadOnlySpan<char> input, string pattern, System.Text.RegularExpressions.RegexOptions options = System.Text.RegularExpressions.RegexOptions.None, System.TimeSpan? matchTimeout = null) { throw null; }
        public System.Text.RegularExpressions.Match Match(string input) { throw null; }
+       public System.Text.RegularExpressions.Match Match(ReadOnlyMemory<char> input) { throw null; }
        public System.Text.RegularExpressions.Match Match(string input, int startat) { throw null; }
        public System.Text.RegularExpressions.Match Match(string input, int beginning, int length) { throw null; }
        public static System.Text.RegularExpressions.Match Match(string input, string pattern) { throw null; }
        public static System.Text.RegularExpressions.Match Match(string input, string pattern, System.Text.RegularExpressions.RegexOptions options) { throw null; }
        public static System.Text.RegularExpressions.Match Match(string input, string pattern, System.Text.RegularExpressions.RegexOptions options, System.TimeSpan matchTimeout) { throw null; }
+       public static System.Text.RegularExpressions.Match Match(ReadOnlyMemory<char> input, string pattern, System.Text.RegularExpressions.RegexOptions options = RegexOptions.None, System.TimeSpan? matchTimeout = null) { throw null; }
        public System.Text.RegularExpressions.MatchCollection Matches(string input) { throw null; }
+       public System.Text.RegularExpressions.MatchCollection Matches(ReadOnlyMemory<char> input) { throw null; }
        public System.Text.RegularExpressions.MatchCollection Matches(string input, int startat) { throw null; }
        public static System.Text.RegularExpressions.MatchCollection Matches(string input, string pattern) { throw null; }
        public static System.Text.RegularExpressions.MatchCollection Matches(string input, string pattern, System.Text.RegularExpressions.RegexOptions options) { throw null; }
        public static System.Text.RegularExpressions.MatchCollection Matches(string input, string pattern, System.Text.RegularExpressions.RegexOptions options, System.TimeSpan matchTimeout) { throw null; }
+       public static System.Text.RegularExpressions.MatchCollection Matches(ReadOnlyMemory<char> input, string pattern, System.Text.RegularExpressions.RegexOptions options = RegexOptions.None, System.TimeSpan? matchTimeout = null) { throw null; }
        public string Replace(string input, string replacement) { throw null; }
        public string Replace(string input, string replacement, int count) { throw null; }
+       public bool TryReplace(ReadOnlySpan<char> input, string replacement, Span<char> destination, out int charsWritten, int count = -1) { throw null; }
        public string Replace(string input, string replacement, int count, int startat) { throw null; }
        public static string Replace(string input, string pattern, string replacement) { throw null; }
        public static string Replace(string input, string pattern, string replacement, System.Text.RegularExpressions.RegexOptions options) { throw null; }
        public static string Replace(string input, string pattern, string replacement, System.Text.RegularExpressions.RegexOptions options, System.TimeSpan matchTimeout) { throw null; }
+       public static bool TryReplace(ReadOnlySpan<char> input, string pattern, string replacement, Span<char> destination, out int charsWritten, System.Text.RegularExpressions.RegexOptions options = System.Text.RegularExpressions.RegexOptions.None, System.TimeSpan? matchTimeout = null) { throw null; }
        public static string Replace(string input, string pattern, System.Text.RegularExpressions.MatchEvaluator evaluator) { throw null; }
        public static string Replace(string input, string pattern, System.Text.RegularExpressions.MatchEvaluator evaluator, System.Text.RegularExpressions.RegexOptions options) { throw null; }
        public static string Replace(string input, string pattern, System.Text.RegularExpressions.MatchEvaluator evaluator, System.Text.RegularExpressions.RegexOptions options, System.TimeSpan matchTimeout) { throw null; }
+       public static bool TryReplace(ReadOnlySpan<char> input, string pattern, System.Text.RegularExpressions.MatchEvaluator evaluator, Span<char> destination, out int charsWritten, System.Text.RegularExpressions.RegexOptions options = System.Text.RegularExpressions.RegexOptions.None, System.TimeSpan? matchTimeout = null) { throw null; }
        public string Replace(string input, System.Text.RegularExpressions.MatchEvaluator evaluator) { throw null; }
        public string Replace(string input, System.Text.RegularExpressions.MatchEvaluator evaluator, int count) { throw null; }
+       public bool TryReplace(ReadOnlySpan<char> input, System.Text.RegularExpressions.MatchEvaluator evaluator, Span<char> destination, out int charsWritten, int count = -1) { throw null; }
        public string Replace(string input, System.Text.RegularExpressions.MatchEvaluator evaluator, int count, int startat) { throw null; }
        public string[] Split(string input) { throw null; }
        public string[] Split(string input, int count) { throw null; }
        public string[] Split(string input, int count, int startat) { throw null; }
+       public SplitEnumerator Split(ReadOnlySpan<char> input, int count = 0) { throw null; }
        public static string[] Split(string input, string pattern) { throw null; }
        public static string[] Split(string input, string pattern, System.Text.RegularExpressions.RegexOptions options) { throw null; }
        public static string[] Split(string input, string pattern, System.Text.RegularExpressions.RegexOptions options, System.TimeSpan matchTimeout) { throw null; }
+       public static SplitEnumerator Split(ReadOnlySpan<char> input, string pattern, RegexOptions options = RegexOptions.None, TimeSpan? matchTimeout = null) { throw null; }
        void System.Runtime.Serialization.ISerializable.GetObjectData(System.Runtime.Serialization.SerializationInfo si, System.Runtime.Serialization.StreamingContext context) { }
        public override string ToString() { throw null; }
        public static string Unescape(string str) { throw null; }
+       public static bool TryUnescape(ReadOnlySpan<char> str, Span<char> destination, out int charsWritten) { throw null; }
        protected bool UseOptionC() { throw null; }
        protected bool UseOptionR() { throw null; }
        protected internal static void ValidateMatchTimeout(System.TimeSpan matchTimeout) { }
    }
}

Discussion points

Ref struct for Match and siblings (Capture & Group).

I had a discussion with Jan offline and he pointed out that we might want to introduce a ref struct MatchValue type that is returned by APIs that take Span/Memory as an input.

The problem with just using the current class Match is that it gives you unsecure access to the Span. For example, you can send the Match object to other thread and start working on the Span there, while the current thread unwinds and frees the memory.
ref struct MatchValue would avoid this issue
(And also saved the alllocation)

The issues with that is that we currently have the following hiearchy: Match --> Group --> Capture and that Groups and Match contain collections of Captures/Groups.

Yes, the flip from class to valuetype tends to be like this. E.g. when we have introduced ValueTask to CoreFX, a bunch of parallel ValueSomething types went with it.
It is a topic for API review discussion
One option is to just not have Span version of the APIs that returns these collections or have callbacks

startat overload

For things like Regex.Replace, the startat argument means copy everything up to startat to the destination Span, and then run regular Replace that does not take startat method. So this looks like a convenience method to me - it saves you from typing a tiny bit of code in rare cases to achieve the same effect.

Should we add these startat convenience overloads for Span also?
If yes, this commit should be reverted ViktorHofer/corefx@bf7d7f9

RegexSplitEnumerator RTL yield order

If you call the Span version of Regex.Split and pass RegexOptions.RightToLeft to it the yield order of the enumerator will also be right to left as we start looking for matches from right to left. The current implementation (which is not an enumerator!) reverses the captured strings before returning.

RegexSplitEnumerator GetEnumerator (see ref diff)

I'm not aware of any other cases in the BCL where we have a GetEnumerator method like this on a struct enumerator. I understand you want it to be able to directly foreach the results without introducing an enumerable struct to serve as the return type, but I'm not sure this is a pattern we want to introduce. You should be sure to highlight this as part of any API review discussion.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions