Skip to content

[API Proposal]: Alternative span/memory/string splitting API #76186

Closed
@stephentoub

Description

@stephentoub

Background and motivation

#934 has been a long-standing issue about Split support for spans. It's evolved into an enumerator that wraps IndexOf and Slice into a slightly tidier package. While that might still be useful, it doesn't make many of the existing use cases for Split very simple, in particular ones where the consumer knows how many split values are expected, wants to extract the Nth value, etc.

Either instead of or in addition to (if we also want an enumerator syntax), we can offer a SplitAsRanges method that operates over ReadOnlySpan<T> in a way and stores the resulting ranges into a provided location, that then also works with Span<T>, {ReadOnly}Memory<T>, and String, that doesn't allocate, that automates the retrieval of N values, etc.

API Proposal

namespace System;

public static class MemoryExtensions
{
+     public static int SplitAsRanges(this ReadOnlySpan<char> source, Span<Range> destination, char separator, StringSplitOptions options = StringSplitOptions.None);
+     public static int SplitAsRanges(this ReadOnlySpan<char> source, Span<Range> destination, ReadOnlySpan<char> separator, StringSplitOptions options = StringSplitOptions.None);
+     public static int SplitAnyAsRanges(this ReadOnlySpan<char> source, Span<Range> destination, ReadOnlySpan<char> separators, StringSplitOptions options = StringSplitOptions.None);
+     public static int SplitAnyAsRanges(this ReadOnlySpan<char> source, Span<Range> destination, ReadOnlySpan<string> separators, StringSplitOptions options = StringSplitOptions.None);
}
  • Naming: we could just call these Spit{Any}, and have the "ranges" aspect of it be implicit in taking a Span<Range> parameter.
  • Argument ordering: destination, separator vs separator, destination? I went with destination, separator so that the configuration-related data (separator and options) are next to each other, but that then does differ from string.Split, where the separator is first.
  • The methods all return the number of System.Range values written into destination. Use that wants to retrieve N segments regardless of whether there are more can stackalloc a span / allocate an array of N Range instance. Use that wants to retrieve N segments and guarantee there are no more than that can stackalloc a span / allocate an array of N+1 Range instances, and validate that the returned count was N.
  • System.Range is unmanaged and can be stackalloc'd.
  • The stored Range instances can be used to slice the original span/memory/string etc. to extract only those values that are needed, in either an allocating or non-allocating manner.

API Usage

Examples...

  1. https://github.com/dotnet/sdk/blob/2f7d6f1928526c29a417c72a9e23497359cbc76f/src/Cli/dotnet/commands/dotnet-workload/install/NetSdkMsiInstallerClient.cs#L159-L169

Instead of:

                        string[] dependentParts = dependent.Split(',');

                        if (dependentParts.Length != 3)
                        {
                            Log?.LogMessage($"Skipping dependent: {dependent}");
                            continue;
                        }

                        try
                        {
                            SdkFeatureBand dependentFeatureBand = new SdkFeatureBand(dependentParts[1]);

this code could be:

                        Span<Range> dependentParts = stackalloc Range[4];
                        ...
                        if (dependent.AsSpan().SplitAsRanges(dependentParts, ',') != 3)
                        {
                            Log?.LogMessage($"Skipping dependent: {dependent}");
                            continue;
                        }

                        try
                        {
                            SdkFeatureBand dependentFeatureBand = new SdkFeatureBand(dependent[dependentParts[1]]);
  1. https://github.com/dotnet/iot/blob/a914669b6b5928d246c0f88486fecc72867bcc76/src/devices/Card/Ndef/Record/GeoRecord.cs#L91-L105

Instead of:

            var strLatLong = Uri.Substring(4).Split(',');
            if (strLatLong.Length != 2)
            {
                throw new ArgumentException($"Record is not a valid {nameof(GeoRecord)}, can't find a proper latitude and longitude in the payload");
            }

            try
            {
                _latitude = Convert.ToDouble(strLatLong[0], CultureInfo.InvariantCulture);
                _longitude = Convert.ToDouble(strLatLong[1], CultureInfo.InvariantCulture);
            }
            catch (Exception ex) when (ex is FormatException || ex is OverflowException)
            {
                throw new ArgumentException($"Record is not a valid {nameof(GeoRecord)}, can't find a proper latitude and longitude in the payload");
            }

this could be:

            Span<Range> strLatLong = stackalloc Range[3];
            ReadOnlySpan<char> span = Uri.AsSpan(4);
            if (span.Split(strLatLong, ',') != 2)
            {
                throw new ArgumentException($"Record is not a valid {nameof(GeoRecord)}, can't find a proper latitude and longitude in the payload");
            }

            try
            {
                _latitude = double.Parse(span[strLatLong[0]], provider: CultureInfo.InvariantCulture);
                _longitude = double.Parse(span[strLatLong[1]], provider: CultureInfo.InvariantCulture);
            }
            catch (Exception ex) when (ex is FormatException || ex is OverflowException)
            {
                throw new ArgumentException($"Record is not a valid {nameof(GeoRecord)}, can't find a proper latitude and longitude in the payload");
            }
  1. while ((zoneTabFileLine = sr.ReadLine()) != null)
    {
    if (!string.IsNullOrEmpty(zoneTabFileLine) && zoneTabFileLine[0] != '#')
    {
    // the format of the line is "country-code \t coordinates \t TimeZone Id \t comments"
    int firstTabIndex = zoneTabFileLine.IndexOf('\t');
    if (firstTabIndex >= 0)
    {
    int secondTabIndex = zoneTabFileLine.IndexOf('\t', firstTabIndex + 1);
    if (secondTabIndex >= 0)
    {
    string timeZoneId;
    int startIndex = secondTabIndex + 1;
    int thirdTabIndex = zoneTabFileLine.IndexOf('\t', startIndex);
    if (thirdTabIndex >= 0)
    {
    int length = thirdTabIndex - startIndex;
    timeZoneId = zoneTabFileLine.Substring(startIndex, length);
    }
    else
    {
    timeZoneId = zoneTabFileLine.Substring(startIndex);
    }
    if (!string.IsNullOrEmpty(timeZoneId))
    {
    timeZoneIds.Add(timeZoneId);
    }
    }
    }
    }
    }

Instead of:

                    while ((zoneTabFileLine = sr.ReadLine()) != null)
                    {
                        if (!string.IsNullOrEmpty(zoneTabFileLine) && zoneTabFileLine[0] != '#')
                        {
                            // the format of the line is "country-code \t coordinates \t TimeZone Id \t comments"

                            int firstTabIndex = zoneTabFileLine.IndexOf('\t');
                            if (firstTabIndex >= 0)
                            {
                                int secondTabIndex = zoneTabFileLine.IndexOf('\t', firstTabIndex + 1);
                                if (secondTabIndex >= 0)
                                {
                                    string timeZoneId;
                                    int startIndex = secondTabIndex + 1;
                                    int thirdTabIndex = zoneTabFileLine.IndexOf('\t', startIndex);
                                    if (thirdTabIndex >= 0)
                                    {
                                        int length = thirdTabIndex - startIndex;
                                        timeZoneId = zoneTabFileLine.Substring(startIndex, length);
                                    }
                                    else
                                    {
                                        timeZoneId = zoneTabFileLine.Substring(startIndex);
                                    }

                                    if (!string.IsNullOrEmpty(timeZoneId))
                                    {
                                        timeZoneIds.Add(timeZoneId);
                                    }
                                }
                            }
                        }
                    }

this could be:

                    Span<Range> ranges = stackalloc Range[4];
                    while ((zoneTabFileLine = sr.ReadLine()) != null)
                    {
                        if (zoneTabFileLine.StartsWith('#'))
                        {
                            // the format of the line is "country-code \t coordinates \t TimeZone Id \t comments"
                            int found = zoneTabFileLine.SplitAsRanges(ranges, '\t');
                            if (found >= 3)
                            {
                                timeZoneId = zoneTabFileLine[ranges[3]];
                                if (timeZoneId.Length != 0)
                                {
                                    timeZoneIds.Add(timeZoneId);
                                }
                            }
                        }
                    }

Alternative Designs

#934 (comment)

Risks

No response

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions