Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New ASCII APIs #75012

Merged
merged 51 commits into from
Dec 21, 2022
Merged

New ASCII APIs #75012

merged 51 commits into from
Dec 21, 2022

Conversation

adamsitnik
Copy link
Member

@adamsitnik adamsitnik commented Sep 2, 2022

fixes #28230

namespace System.Text;

public static class Ascii
{   
    public static bool IsValid(ReadOnlySpan<byte> value);
    public static bool IsValid(ReadOnlySpan<char> value);

    public static bool IsValid(byte value);
    public static bool IsValid(char value);

    public static OperationStatus ToUpper(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesWritten);
    public static OperationStatus ToUpper(ReadOnlySpan<char> source, Span<char> destination, out int charsWritten);
    public static OperationStatus ToUpper(ReadOnlySpan<byte> source, Span<char> destination, out int charsWritten);
    public static OperationStatus ToUpper(ReadOnlySpan<char> source, Span<byte> destination, out int bytesWritten);

    public static OperationStatus ToLower(ReadOnlySpan<byte> source, Span<byte> destination, out int bytesWritten);
    public static OperationStatus ToLower(ReadOnlySpan<char> source, Span<char> destination, out int charsWritten);
    public static OperationStatus ToLower(ReadOnlySpan<byte> source, Span<char> destination, out int charsWritten);
    public static OperationStatus ToLower(ReadOnlySpan<char> source, Span<byte> destination, out int bytesWritten);

    public static OperationStatus ToUpperInPlace(Span<byte> value, out int bytesWritten);
    public static OperationStatus ToUpperInPlace(Span<char> value, out int charsWritten);

    public static OperationStatus ToLowerInPlace(Span<byte> value, out int bytesWritten);
    public static OperationStatus ToLowerInPlace(Span<char> value, out int charsWritten);

    public static OperationStatus ToUtf16(ReadOnlySpan<byte> source, Span<char> destination, out int charsWritten);
    public static OperationStatus FromUtf16(ReadOnlySpan<char> source, Span<byte> destination, out int bytesWritten);

    public static Range Trim(ReadOnlySpan<byte> value);
    public static Range Trim(ReadOnlySpan<char> value);
    public static Range TrimStart(ReadOnlySpan<byte> value);
    public static Range TrimStart(ReadOnlySpan<char> value);
    public static Range TrimEnd(ReadOnlySpan<byte> value);
    public static Range TrimEnd(ReadOnlySpan<char> value);
}

@davidfowl
Copy link
Member

cc @BrennanConroy

# Conflicts:
#	src/libraries/System.Private.CoreLib/src/System/Globalization/TextInfo.cs
#	src/libraries/System.Private.Uri/src/System/DomainNameHelper.cs
#	src/libraries/System.Private.Uri/src/System/UriHelper.cs
@adamsitnik
Copy link
Member Author

@stephentoub the PR is ready, is there any chance you could re-review it?

Comment on lines +602 to +603
private struct ToUpperConversion { }
private struct ToLowerConversion { }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could there be used static abstract interfaces instead?

}
}
OperationStatus conversionStatus = Ascii.FromUtf16(new ReadOnlySpan<char>(pManaged, length), new Span<byte>(pNative, length), out _);
Debug.Assert(conversionStatus == OperationStatus.Done);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR, but as a potential follow-up, if it'll be common for pNative to be non-NULL and if data suggests it'll be long enough on average, it might be worth trying a fast-path that just does FromUtf16 without first doing IsValid.

@@ -188,11 +189,12 @@ private protected sealed override unsafe int GetByteCountFast(char* pChars, int

if (!(fallback is EncoderReplacementFallback replacementFallback
&& replacementFallback.MaxCharCount == 1
&& replacementFallback.DefaultString[0] <= 0x7F))
&& Ascii.IsValid(replacementFallback.DefaultString[0])))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we thinking about char.IsAscii(char) vs Ascii.IsValid(char)?
https://source.dot.net/#System.Private.CoreLib/src/libraries/System.Private.CoreLib/src/System/Char.cs,33d30f343eda0003,references
Do we need the new char/byte methods at all? Should we have Array.IsValid(int value) instead of both of the char/byte overloads?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ascii.IsValid(char) was added mostly for completeness and having everything in once place.

Should we have Array.IsValid(int value)

I assume you meant Ascii.IsValid(int)? On the one hand it would be more flexible (users could pass integers as inputs too), on the other I am not sure about codegen differences and whether everyone knows that they can pass byte/char to int-accepting method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ascii.IsValid(char) was added mostly for completeness and having everything in once place.

So which should we be using? Should we change all use of char.IsAscii to be Ascii.IsValid?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please no :) Personally I find char.IsAscii to be more readable.

Copy link
Member

@stephentoub stephentoub Jan 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then should we replace all Ascii.IsValid(char) with char.IsAscii?

We "just" added char.IsAscii in .NET 6. Now in .NET 8 we're adding an identical Ascii.IsValid(char). I'm trying to rationalize why we have both. I get the consistency argument, but I don't think it's worth consistency just to have a method no one uses.

Hence my question about whether we should just have the one Ascii.IsValid(int), or maybe both IsValid(int) and IsValid(uint). Would there be performance benefits / cons to that? We do have cases today where we check {u}ints for < 0x80, e.g.

private static bool Basic(uint cp) =>
// Is it in ASCII range?
cp < 0x80;

public static bool IsAsciiCodePoint(uint value) => value <= 0x7Fu;

uint firstByte = pInputBuffer[0];
if (firstByte <= 0x7Fu)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adamsitnik, opinions?

where TTo : unmanaged, IBinaryInteger<TTo>
where TCasing : struct
{
if (MemoryMarshal.AsBytes(source).Overlaps(MemoryMarshal.AsBytes(destination)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are the casts to bytes necessary?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFROM and TTO can be different (char and byte for example) and in theory someone could do sth like this:

Assert.Throws<InvalidOperationException>(() => Ascii.ToLower(byteBuffer, MemoryMarshal.Cast<byte, char>(byteBuffer), out _));
Assert.Throws<InvalidOperationException>(() => Ascii.ToLower(byteBuffer, MemoryMarshal.Cast<byte, char>(byteBuffer).Slice(1, 3), out _));

Since everything can be casted to bytes I decided to use it and cover all possible scenarios, but to be honest I did it mostly to cover all possible unit testing scenarios rather than expecting someone to actually do it.

Comment on lines +56 to +57
uint elementValue = uint.CreateTruncating(value[start]);
if ((elementValue > 0x20) || ((TrimMask & (1u << ((int)elementValue - 1))) == 0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could make this branch-free with:

        uint c = (ushort)(uint.CreateTruncating(value) - '\t');
        if ((int)((0xF8000100U << (short)c) & (c - 32)) >= 0)

See

// Next, handle sets where the high - low + 1 range is <= 32. In that case, we can emit
// a branchless lookup in a uint that does not rely on loading any objects (e.g. the string-based
// lookup we use later). This nicely handles common sets like [\t\r\n ].
if (analysis.OnlyRanges && (analysis.UpperBoundExclusiveIfOnlyRanges - analysis.LowerBoundInclusiveIfOnlyRanges) <= 32)
{
additionalDeclarations.Add("uint charMinusLowUInt32;");
// Create the 32-bit value with 1s at indices corresponding to every character in the set,
// where the bit is computed to be the char value minus the lower bound starting from
// most significant bit downwards.
bool negatedClass = RegexCharClass.IsNegated(charClass);
uint bitmap = 0;
for (int i = analysis.LowerBoundInclusiveIfOnlyRanges; i < analysis.UpperBoundExclusiveIfOnlyRanges; i++)
{
if (RegexCharClass.CharInClass((char)i, charClass) ^ negatedClass)
{
bitmap |= 1u << (31 - (i - analysis.LowerBoundInclusiveIfOnlyRanges));
}
}
// To determine whether a character is in the set, we subtract the lowest char; this subtraction happens before the result is
// zero-extended to uint, meaning that `charMinusLowUInt32` will always have upper 16 bits equal to 0.
// We then left shift the constant with this offset, and apply a bitmask that has the highest
// bit set (the sign bit) if and only if `chExpr` is in the [low, low + 32) range.
// Then we only need to check whether this final result is less than 0: this will only be
// the case if both `charMinusLowUInt32` was in fact the index of a set bit in the constant, and also
// `chExpr` was in the allowed range (this ensures that false positive bit shifts are ignored).
negate ^= negatedClass;
return $"((int)((0x{bitmap:X}U << (short)(charMinusLowUInt32 = (ushort)({chExpr} - {Literal((char)analysis.LowerBoundInclusiveIfOnlyRanges)}))) & (charMinusLowUInt32 - 32)) {(negate ? ">=" : "<")} 0)";
}
for an explanation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible I would prefer to avoid adding further optimizations now, merge the API, add benchmarks to dotnet/performance and then let others tune it.

Copy link
Member

@jeffhandley jeffhandley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed this mostly through the lens of how the new methods are used--each usage is a nice improvement in maintainability. I also appreciate the copious comments through these implementations and tests.

@dakersnar
Copy link
Contributor

dakersnar commented Jan 5, 2023

@lewing
Copy link
Member

lewing commented Jan 10, 2023

mono interpreter regression dotnet/perf-autofiling-issues#11147
there is likely a wasm regression as well but it won't show up until the benchmarks are fixed there

@EgorBo
Copy link
Member

EgorBo commented Jan 31, 2023

It seems this PR still has regressions opened, e.g. these 5 (we don't have a lot of coverage for ToLower so only 5) regressed if you open "full history" tab: dotnet/perf-autofiling-issues#10226

image

My guess that it's because the SIMD part to change case was a bit more efficient in #78262, e.g.:

static Vector128<sbyte> ToLower(Vector128<sbyte> src)
{
    var lowInd = Vector128.Create((sbyte)63) + src;
    var combInd = Vector128.LessThan(Vector128.Create((sbyte)-103), lowInd);
    return Vector128.AndNot(Vector128.Create((sbyte)0x20), combInd) + src;
}

@adamsitnik
Copy link
Member Author

@EgorBo we have an issue that is tracking it: #80245

My guess that it's because the SIMD part to change case was a bit more efficient in

The new methods support 4 different combinations of input and output: byte->byte, byte->char, char->char and char->byte. The generic code is most likely not as optimal as it was for char->char implementation we had:

TFrom SourceSignedMinValue = TFrom.CreateTruncating(1 << (8 * sizeof(TFrom) - 1));
Vector128<TFrom> subtractionVector = Vector128.Create(conversionIsToUpper ? (SourceSignedMinValue + TFrom.CreateTruncating('a')) : (SourceSignedMinValue + TFrom.CreateTruncating('A')));
Vector128<TFrom> comparisionVector = Vector128.Create(SourceSignedMinValue + TFrom.CreateTruncating(26 /* A..Z or a..z */));
Vector128<TFrom> caseConversionVector = Vector128.Create(TFrom.CreateTruncating(0x20)); // works both directions
Vector128<TFrom> matches = SignedLessThan((srcVector - subtractionVector), comparisionVector);
srcVector ^= (matches & caseConversionVector);
// Now write to the destination.
ChangeWidthAndWriteTo(srcVector, pDest, 0);

@ghost ghost locked as resolved and limited conversation to collaborators Mar 2, 2023
@jeffhandley jeffhandley added the blog-candidate Completed PRs that are candidate topics for blog post coverage label Mar 14, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Buffers blog-candidate Completed PRs that are candidate topics for blog post coverage new-api-needs-documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create APIs to deal with processing ASCII text (as bytes)