This repository was archived by the owner on Jan 23, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Add optimized UTF-8 validation and transcoding apis, hook them up to UTF8Encoding #21948
Merged
GrabYourPitchforks
merged 32 commits into
dotnet:master
from
GrabYourPitchforks:utf8_validation_apis
Apr 12, 2019
Merged
Changes from all commits
Commits
Show all changes
32 commits
Select commit
Hold shift + click to select a range
abd7add
Add optimized UTF-8 validation and transcoding logic
GrabYourPitchforks 21721c2
Merge remote-tracking branch 'origin/master' into utf8_validation_apis
GrabYourPitchforks 7198253
Merge commit 'aab338efb65808acc1ff9b7f95bd3dd5f0a6a3be' into utf8_val…
GrabYourPitchforks effeb6e
Hook up new UTF-8 logic through UTF8Encoding
GrabYourPitchforks 4caf96a
Improve perf of "is ASCII?" inner loop in UTF-8 validation.
GrabYourPitchforks 3f939ee
Remove SSE41.X64 optimization from AsciiUtility
GrabYourPitchforks 2be465e
Merge remote-tracking branch 'origin/master' into utf8_validation_apis_3
GrabYourPitchforks fbb7246
Merge remote-tracking branch 'origin/master' into utf8_validation_apis_3
GrabYourPitchforks 15a361a
Merge remote-tracking branch 'origin/master' into utf8_validation_apis_3
GrabYourPitchforks 5995f81
Clarify that vector read is unaligned
GrabYourPitchforks c4e94df
Simplify vectorized logic; remove unnecessary adjustment
GrabYourPitchforks 5db40d7
Merge remote-tracking branch 'origin/master' into utf8_validation_apis_3
GrabYourPitchforks f4519f5
PR feedback: GetElement(0) -> Sse2.StoreLow
GrabYourPitchforks 9baf5be
PR feedback
GrabYourPitchforks 2a3fe36
PR feedback: Enable SSE2 in Utf16Utility code
GrabYourPitchforks 2d47f1b
Expand masks in Utf8Utility, fix const in fallback path
GrabYourPitchforks 356ae17
Temporarily disable failing CoreFX tests
GrabYourPitchforks 5257614
Fix incorrect Debug.Assert statements
GrabYourPitchforks 6cc74a9
Add comments tracking JIT workarounds.
GrabYourPitchforks 60b8c4f
Rename DWORD -> UInt32 throughout API surface
GrabYourPitchforks bae40fa
Re-flow Utf8Utility.Helpers
GrabYourPitchforks eda0c04
Merge remote-tracking branch 'origin/master' into utf8_validation_apis_3
GrabYourPitchforks 3e09d17
PR feedback: Fix typos
GrabYourPitchforks 56fa69d
PR feedback: CountNumberOfLeadingAsciiBytesFrom24BitInteger
GrabYourPitchforks 8f2860d
PR feedback: Remove redundant endianess checks
GrabYourPitchforks ce13100
PR feedback: Validate nint definitions
GrabYourPitchforks c3aa431
PR feedback: Clarify charIsNonAscii vector usage
GrabYourPitchforks d8e2589
PR feedback: document tempUtf8CodeUnitCountAdjustment usage
GrabYourPitchforks 7558272
Fix compilation failure in Utf16Utility
GrabYourPitchforks affc24a
PR feedback: Clarify 3-byte sequence processing
GrabYourPitchforks 0adca8b
Add missing check to 3-byte processing logic
GrabYourPitchforks e307d14
Clarify comment in 3-byte processing
GrabYourPitchforks File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
108 changes: 108 additions & 0 deletions
108
src/System.Private.CoreLib/shared/System/Text/ASCIIUtility.Helpers.cs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,108 @@ | ||
// Licensed to the .NET Foundation under one or more agreements. | ||
// The .NET Foundation licenses this file to you under the MIT license. | ||
// See the LICENSE file in the project root for more information. | ||
|
||
using System.Diagnostics; | ||
using System.Numerics; | ||
using System.Runtime.CompilerServices; | ||
using System.Runtime.Intrinsics.X86; | ||
|
||
namespace System.Text | ||
{ | ||
internal static partial class ASCIIUtility | ||
{ | ||
/// <summary> | ||
/// A mask which selects only the high bit of each byte of the given <see cref="uint"/>. | ||
/// </summary> | ||
private const uint UInt32HighBitsOnlyMask = 0x80808080u; | ||
|
||
/// <summary> | ||
/// A mask which selects only the high bit of each byte of the given <see cref="ulong"/>. | ||
/// </summary> | ||
private const ulong UInt64HighBitsOnlyMask = 0x80808080_80808080ul; | ||
|
||
/// <summary> | ||
/// Returns <see langword="true"/> iff all bytes in <paramref name="value"/> are ASCII. | ||
/// </summary> | ||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
internal static bool AllBytesInUInt32AreAscii(uint value) | ||
{ | ||
// If the high bit of any byte is set, that byte is non-ASCII. | ||
|
||
return (value & UInt32HighBitsOnlyMask) == 0; | ||
} | ||
|
||
/// <summary> | ||
/// Given a DWORD which represents a four-byte buffer read in machine endianness, and which | ||
/// the caller has asserted contains a non-ASCII byte *somewhere* in the data, counts the | ||
/// number of consecutive ASCII bytes starting from the beginning of the buffer. Returns | ||
/// a value 0 - 3, inclusive. (The caller is responsible for ensuring that the buffer doesn't | ||
/// contain all-ASCII data.) | ||
/// </summary> | ||
[MethodImpl(MethodImplOptions.AggressiveInlining)] | ||
internal static uint CountNumberOfLeadingAsciiBytesFromUInt32WithSomeNonAsciiData(uint value) | ||
{ | ||
Debug.Assert(!AllBytesInUInt32AreAscii(value), "Caller shouldn't provide an all-ASCII value."); | ||
|
||
// Use BMI1 directly rather than going through BitOperations. We only see a perf gain here | ||
// if we're able to emit a real tzcnt instruction; the software fallback used by BitOperations | ||
// is too slow for our purposes since we can provide our own faster, specialized software fallback. | ||
|
||
if (Bmi1.IsSupported) | ||
{ | ||
Debug.Assert(BitConverter.IsLittleEndian); | ||
return Bmi1.TrailingZeroCount(value & UInt32HighBitsOnlyMask) >> 3; | ||
} | ||
|
||
// Couldn't emit tzcnt, use specialized software fallback. | ||
// The 'allBytesUpToNowAreAscii' DWORD uses bit twiddling to hold a 1 or a 0 depending | ||
// on whether all processed bytes were ASCII. Then we accumulate all of the | ||
// results to calculate how many consecutive ASCII bytes are present. | ||
|
||
value = ~value; | ||
|
||
if (BitConverter.IsLittleEndian) | ||
{ | ||
// Read first byte | ||
value >>= 7; | ||
uint allBytesUpToNowAreAscii = value & 1; | ||
uint numAsciiBytes = allBytesUpToNowAreAscii; | ||
|
||
// Read second byte | ||
value >>= 8; | ||
allBytesUpToNowAreAscii &= value; | ||
numAsciiBytes += allBytesUpToNowAreAscii; | ||
|
||
// Read third byte | ||
value >>= 8; | ||
allBytesUpToNowAreAscii &= value; | ||
numAsciiBytes += allBytesUpToNowAreAscii; | ||
|
||
return numAsciiBytes; | ||
} | ||
else | ||
{ | ||
// BinaryPrimitives.ReverseEndianness is only implemented as an intrinsic on | ||
// little-endian platforms, so using it in this big-endian path would be too | ||
// expensive. Instead we'll just change how we perform the shifts. | ||
|
||
// Read first byte | ||
value = BitOperations.RotateLeft(value, 1); | ||
uint allBytesUpToNowAreAscii = value & 1; | ||
uint numAsciiBytes = allBytesUpToNowAreAscii; | ||
|
||
// Read second byte | ||
value = BitOperations.RotateLeft(value, 8); | ||
allBytesUpToNowAreAscii &= value; | ||
numAsciiBytes += allBytesUpToNowAreAscii; | ||
|
||
// Read third byte | ||
value = BitOperations.RotateLeft(value, 8); | ||
allBytesUpToNowAreAscii &= value; | ||
numAsciiBytes += allBytesUpToNowAreAscii; | ||
|
||
return numAsciiBytes; | ||
GrabYourPitchforks marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} | ||
} | ||
} | ||
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.