Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorized common String.Split() paths #38001

Merged
merged 28 commits into from
Apr 7, 2021
Merged
Changes from 8 commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
52ad5ac
Vectorized String.Split()
bbartels Jun 17, 2020
10417a3
Merge branch 'master' of https://github.com/dotnet/runtime into Vecto…
bbartels Jun 17, 2020
140c2ce
Fixed variable name
bbartels Jun 17, 2020
58918a1
Update src/libraries/System.Private.CoreLib/src/System/String.Manipul…
bbartels Jun 17, 2020
be469ff
Update src/libraries/System.Private.CoreLib/src/System/String.Manipul…
bbartels Jun 17, 2020
1969c67
Merge branch 'master' of https://github.com/dotnet/runtime into Vecto…
bbartels Jun 17, 2020
79b32be
Merge branch 'VectorizedSplit' of https://github.com/bbartels/runtime…
bbartels Jun 17, 2020
44db429
Applied Review Feedback
bbartels Jun 17, 2020
995719b
Update src/libraries/System.Private.CoreLib/src/System/String.Manipul…
bbartels Jun 18, 2020
65cba0c
Applied Review Feedback
bbartels Jun 18, 2020
b437551
Merge branch 'master' of https://github.com/dotnet/runtime into Vecto…
bbartels Jun 18, 2020
0cdcfd7
Merge branch 'VectorizedSplit' of https://github.com/bbartels/runtime…
bbartels Jun 18, 2020
3d2a645
Built branchless version with help of @gfoidl
bbartels Jun 18, 2020
d9a8640
Update src/libraries/System.Private.CoreLib/src/System/String.Manipul…
bbartels Jun 18, 2020
dc8926e
Merge branch 'master' of https://github.com/dotnet/runtime into Vecto…
bbartels Jun 18, 2020
1b73a5d
Merge branch 'VectorizedSplit' of https://github.com/bbartels/runtime…
bbartels Jun 18, 2020
2b0b8f8
Removed nullable separator parameters
bbartels Jun 18, 2020
cc4ae1f
Refactored MakeSeparatorList
bbartels Jun 18, 2020
11c968b
Fixed mistakenly removed comments
bbartels Jun 18, 2020
707871a
Removed dependency on BMI2 PEXT instruction
bbartels Jun 19, 2020
19e57f0
Fixed mistaken use of Vector<ushort>.Count
bbartels Jun 19, 2020
82329c6
Lowered string.Split() vectorization dependency from Avx2 to SSE41
bbartels Jun 24, 2020
7d1fe48
Merge branch 'master' into VectorizedSplit
bbartels Jan 25, 2021
aa7454a
Added Sse.IsSupported check
bbartels Jan 25, 2021
e0f8337
Updated IsSupported check to match highest used ISA
bbartels Mar 22, 2021
14eaf11
Merge branch 'main' of https://github.com/dotnet/runtime into Vectori…
bbartels Mar 26, 2021
10da409
Merge branch 'VectorizedSplit' of https://github.com/bbartels/runtime…
bbartels Mar 26, 2021
d55cf92
Fixed possible cause for failing tests
bbartels Mar 26, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,9 @@
using System.Diagnostics;
using System.Globalization;
using System.Numerics;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
using System.Text;
using Internal.Runtime.CompilerServices;

Expand Down Expand Up @@ -1573,6 +1576,13 @@ private void MakeSeparatorList(ReadOnlySpan<char> separators, ref ValueListBuild
// Special-case the common cases of 1, 2, and 3 separators, with manual comparisons against each separator.
case 1:
sep0 = separators[0];

if (Avx2.IsSupported && 16 <= Length)
{
MakeSeparatorListVectorized(ref sepListBuilder, sep0);
return;
}

for (int i = 0; i < Length; i++)
{
if (this[i] == sep0)
Expand All @@ -1584,6 +1594,13 @@ private void MakeSeparatorList(ReadOnlySpan<char> separators, ref ValueListBuild
case 2:
sep0 = separators[0];
sep1 = separators[1];

if (Avx2.IsSupported && 16 <= Length)
{
MakeSeparatorListVectorized(ref sepListBuilder, sep0, sep1);
return;
}

for (int i = 0; i < Length; i++)
{
char c = this[i];
Expand All @@ -1597,6 +1614,13 @@ private void MakeSeparatorList(ReadOnlySpan<char> separators, ref ValueListBuild
sep0 = separators[0];
sep1 = separators[1];
sep2 = separators[2];

if (Avx2.IsSupported && 16 <= Length)
bbartels marked this conversation as resolved.
Show resolved Hide resolved
{
MakeSeparatorListVectorized(ref sepListBuilder, sep0, sep1, sep2);
return;
}

for (int i = 0; i < Length; i++)
{
char c = this[i];
Expand Down Expand Up @@ -1630,6 +1654,72 @@ private void MakeSeparatorList(ReadOnlySpan<char> separators, ref ValueListBuild
}
}

private void MakeSeparatorListVectorized(ref ValueListBuilder<int> sepListBuilder, char c, char? c2 = null, char? c3 = null)
{
// Constant that defines indices of characters within an AVX-Register
const ulong indicesConstant = 0xFEDCBA9876543210;
bbartels marked this conversation as resolved.
Show resolved Hide resolved
ReadOnlySpan<byte> shuffleConstantData = new byte[] {
bbartels marked this conversation as resolved.
Show resolved Hide resolved
bbartels marked this conversation as resolved.
Show resolved Hide resolved
0x02, 0x06, 0x0A, 0x0E, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF,
0xFF, 0xFF, 0xFF, 0xFF, 0x02, 0x06, 0x0A, 0x0E, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF
};
Vector256<byte> shuffleConstant = Unsafe.ReadUnaligned<Vector256<byte>>(ref MemoryMarshal.GetReference(shuffleConstantData));

Vector256<ushort> v1 = Vector256.Create(c);
Vector256<ushort>? v2 = c2 is char sep2 ? Vector256.Create(sep2) : (Vector256<ushort>?)null;
Vector256<ushort>? v3 = c3 is char sep3 ? Vector256.Create(sep3) : (Vector256<ushort>?)null;

ref char c0 = ref MemoryMarshal.GetReference(this.AsSpan());
int cond = Length - (Length % Vector256<ushort>.Count);
bbartels marked this conversation as resolved.
Show resolved Hide resolved
int i = 0;

for (; i < cond; i += Vector256<ushort>.Count)
{
Vector256<ushort> charVector = ReadVector(ref c0, i);
Vector256<ushort> cmp = Avx2.CompareEqual(charVector, v1);

if (v2 is Vector256<ushort> vecSep2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought on eliminating these branches (didn't validate it, don't know if it's worth it).

Instead of checking v2 is Vector in each iteration, one could set v2 to a dummy-value, so that iif c2 is null the result of Avx2.Or(Avx2.CompareEqual(charVector, vecSep2), cmp) won't be changed.

A such dummy-value is Vector256<byte>.Zero.
With this value it won't work if the input is all zero though, but otherwise the two branches (in the loop) are eliminated resulting in smaller loop-bodies.

For v3 analogous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good idea! Instead of setting it to Vector256<byte>.Zero I could try with the inverse of the other separator char, this way it would even work when c == '\0'. I'll give it a benchmark and see if it makes a difference!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, upon benchmarking this seems to be significantly slower (-20%) in cases, where there is only a single separator char (which I feel like is probably the most common scenario). I'm leaving this open for now if someone has some other ideas about eliminating the branches.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

significantly slower (-20%) in cases, where there is only a single separator char

TBH I can't believe this. Can you show the code?
The branch-predictor does a good job, but such an decrease with branchless operations that can be executed in parallel (instruction level parallelism) is not what I expected.

the most common scenario

I believe so too. Maybe it is worth it to special case this (when the branchless variant proves to be definitely slower 😉).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I am misunderstanding something, were you suggesting the following:

if (v2 is Vector256<ushort> vecSep2)
{
    cmp = Avx2.Or(Avx2.CompareEqual(vi, vecSep2), cmp);
}

if (v3 is Vector256<ushort> vecSep3)
{
    cmp = Avx2.Or(Avx2.CompareEqual(vi, vecSep3), cmp);
}

to

cmp = Avx2.Or(Avx2.CompareEqual(vi, vecSep2), cmp);
cmp = Avx2.Or(Avx2.CompareEqual(vi, vecSep3), cmp);

where vecSep2/3 are defaulted to something that wouldn't interfere with the existing cmp value if only one separator was defined?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks quite good (expecte the stack-spills I'm seeing / investigating).

If \0 isn't a concern, so the default values of the args c{2,3} could be set to \0 to avoid the branches at creating the vectors. This could a have a positive effect on smaller inputs.

With your idea to setting it to v1 you're on the safe side though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having nullable arguments, just set the dummy-values in


as there it's known. You know what I mean?

Copy link
Contributor Author

@bbartels bbartels Jun 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As in calling it like MakeSeparatorListVectorized(ref sepListBuilder, sep0, sep0, sep0) ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep.

// ...
MakeSeparatorListVectorized(ref sepListBuilder, sep0, sep0, sep0);
// ...
MakeSeparatorListVectorized(ref sepListBuilder, sep0, sep1, sep0);
// ...
MakeSeparatorListVectorized(ref sepListBuilder, sep0, sep1, sep2);

Then the checks in MakeSeparatorListVectorized for null (default arg) can be eliminated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done 2b0b8f8

{
cmp = Avx2.Or(Avx2.CompareEqual(charVector, vecSep2), cmp);
}

if (v3 is Vector256<ushort> vecSep3)
{
cmp = Avx2.Or(Avx2.CompareEqual(charVector, vecSep3), cmp);
}

if (Avx.TestZ(cmp, cmp)) { continue; }

Vector256<byte> mask = Avx2.ShiftLeftLogical(cmp.AsUInt64(), 4).AsByte();
mask = Avx2.Shuffle(mask, shuffleConstant);

Vector128<byte> res = Sse2.Or(mask.GetLower(), mask.GetUpper());
ulong extractedBits = Bmi2.X64.ParallelBitExtract(indicesConstant, Sse2.X64.ConvertToUInt64(res.AsUInt64()));

while (true)
{
sepListBuilder.Append(((byte)(extractedBits & 0xF)) + i);
Copy link
Member

@gfoidl gfoidl Jun 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This causes the vectors to be spilled to the stack.

asm from loop
G_M15376_IG06:
       8BFA                 mov      edi, edx
       C4C17E6F247F         vmovdqu  ymm4, ymmword ptr[r15+2*rdi]
       C5FD114D90           vmovupd  ymmword ptr[rbp-70H], ymm1
       C5DD75E9             vpcmpeqw ymm5, ymm4, ymm1
       C5FD119570FFFFFF     vmovupd  ymmword ptr[rbp-90H], ymm2
       C5DD75F2             vpcmpeqw ymm6, ymm4, ymm2
       C5CDEBED             vpor     ymm5, ymm6, ymm5
       C5FD119D50FFFFFF     vmovupd  ymmword ptr[rbp-B0H], ymm3
       C5DD75E3             vpcmpeqw ymm4, ymm4, ymm3
       C5DDEBED             vpor     ymm5, ymm4, ymm5
       C4E27D17ED           vptest   ymm5, ymm5
       0F84B7010000         je       G_M15376_IG22
						;; bbWeight=4    PerfScore 57.67
G_M15376_IG07:
       C5DD73F504           vpsllq   ymm4, ymm5, 4
       C5FD1145B0           vmovupd  ymmword ptr[rbp-50H], ymm0
       C4E25D00E0           vpshufb  ymm4, ymm4, ymm0
       C4E37D39E501         vextracti128 ymm5, ymm4, 1
       C5D9EBE5             vpor     xmm4, xmm4, xmm5
       C4E1F97EE1           vmovd    rcx, xmm4
       49B81032547698BADCFE mov      r8, 0xFEDCBA9876543210
       4933C8               xor      rcx, r8
						;; bbWeight=2    PerfScore 17.67
G_M15376_IG08:
       48898D40FFFFFF       mov      qword ptr [rbp-C0H], rcx
       448BC1               mov      r8d, ecx
       4183E00F             and      r8d, 15
       899548FFFFFF         mov      dword ptr [rbp-B8H], edx
       4403C2               add      r8d, edx
       4489853CFFFFFF       mov      dword ptr [rbp-C4H], r8d
       448B4B08             mov      r9d, dword ptr [rbx+8]
       4C8D5310             lea      r10, bword ptr [rbx+16]
       4C899528FFFFFF       mov      bword ptr [rbp-D8H], r10
       498BFA               mov      rdi, r10
       44898D38FFFFFF       mov      dword ptr [rbp-C8H], r9d
       443B4F08             cmp      r9d, dword ptr [rdi+8]
       7C08                 jl       SHORT G_M15376_IG10
						;; bbWeight=16    PerfScore 184.00
G_M15376_IG09:
       488BFB               mov      rdi, rbx
       E870F1FFFF           call     System.Collections.Generic.ValueListBuilder`1[Int32][System.Int32]:Grow():this
						;; bbWeight=4    PerfScore 5.00
G_M15376_IG10:
       4C8B9528FFFFFF       mov      r10, bword ptr [rbp-D8H]
       448B8D38FFFFFF       mov      r9d, dword ptr [rbp-C8H]
       453B4A08             cmp      r9d, dword ptr [r10+8]
       0F8334010000         jae      G_M15376_IG23
       498B3A               mov      rdi, bword ptr [r10]
       4D63D1               movsxd   r10, r9d
       448B853CFFFFFF       mov      r8d, dword ptr [rbp-C4H]
       46890497             mov      dword ptr [rdi+4*r10], r8d
       41FFC1               inc      r9d
       44894B08             mov      dword ptr [rbx+8], r9d
       488B8D40FFFFFF       mov      rcx, qword ptr [rbp-C0H]
       48C1E904             shr      rcx, 4
       4885C9               test     rcx, rcx
       0F85EC000000         jne      G_M15376_IG21
						;; bbWeight=16    PerfScore 236.00
G_M15376_IG11:
       8B9548FFFFFF         mov      edx, dword ptr [rbp-B8H]
       83C210               add      edx, 16
       8BB54CFFFFFF         mov      esi, dword ptr [rbp-B4H]
       3BD6                 cmp      edx, esi
       89B54CFFFFFF         mov      dword ptr [rbp-B4H], esi
       C5FD1045B0           vmovupd  ymm0, ymmword ptr[rbp-50H]
       C5FD104D90           vmovupd  ymm1, ymmword ptr[rbp-70H]
       C5FD109570FFFFFF     vmovupd  ymm2, ymmword ptr[rbp-90H]
       C5FD109D50FFFFFF     vmovupd  ymm3, ymmword ptr[rbp-B0H]
       0F82D9FEFFFF         jb       G_M15376_IG06

It's not exactely the loop-code from this PR, it's with some minor modifications from my attempts to remove the spills. The code from the PR has the same spills, anyway.

I don't understand why the JIT spills the vectors. They could (and should) be kept in the registers, as the ValueListBuilder<int> doesn't need any registers. Or is this from the builder's internal usage of span and zeroing behavior, but even if so, the JIT shouldn't spill the vectors...

Things I tried to tweak the JIT into not spilling:

  • Moving this line into a non-inlineable local function didn't change anything.

Copy link
Contributor Author

@bbartels bbartels Jun 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it matter though? The vectors don't seem to be used after L1687 anyways?
EDIT: Or is this about shuffleConstant, v1, v2, v3?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or is this about shuffleConstant, v1, v2, v3?

No.

Does it matter though?

Perf-wise: yes.

  • stack-spills are (here) unnecessary memory accesses
  • the machine code will be quite a lot smaller without the spills -- especially important as it's within a loop

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up removing the need for PEXT and just iterated over each potential index within the extracted vector. For some reason I've been seeing some decent performance gains, even in scenarios with low separator frequency (which to me would point to perf regressions over PEXT). Could this inadvertently have fixed the stack-spills? Either way I can't quite understand the perf uplift.

extractedBits >>= 4;
if (extractedBits == 0) { break; }
}
}

for (; i < Length; i++)
{
char curr = this[i];
bbartels marked this conversation as resolved.
Show resolved Hide resolved
if (curr == c || (c2 != null && curr == c2) || (c3 != null && curr == c3))
bbartels marked this conversation as resolved.
Show resolved Hide resolved
{
sepListBuilder.Append(i);
}
}

static Vector256<ushort> ReadVector(ref char c0, int offset)
{
ref char ci = ref Unsafe.Add(ref c0, offset);
bbartels marked this conversation as resolved.
Show resolved Hide resolved
ref byte b = ref Unsafe.As<char, byte>(ref ci);
return Unsafe.ReadUnaligned<Vector256<ushort>>(ref b);
}
}

/// <summary>
/// Uses ValueListBuilder to create list that holds indexes of separators in string.
/// </summary>
Expand Down