Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorized common String.Split() paths #38001
Vectorized common String.Split() paths #38001
Changes from 8 commits
52ad5ac
10417a3
140c2ce
58918a1
be469ff
1969c67
79b32be
44db429
995719b
65cba0c
b437551
0cdcfd7
3d2a645
d9a8640
dc8926e
1b73a5d
2b0b8f8
cc4ae1f
11c968b
707871a
19e57f0
82329c6
7d1fe48
aa7454a
e0f8337
14eaf11
10da409
d55cf92
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought on eliminating these branches (didn't validate it, don't know if it's worth it).
Instead of checking
v2 is Vector
in each iteration, one could setv2
to a dummy-value, so that iifc2
is null the result ofAvx2.Or(Avx2.CompareEqual(charVector, vecSep2), cmp)
won't be changed.A such dummy-value is
Vector256<byte>.Zero
.With this value it won't work if the input is all zero though, but otherwise the two branches (in the loop) are eliminated resulting in smaller loop-bodies.
For
v3
analogous.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good idea! Instead of setting it to
Vector256<byte>.Zero
I could try with the inverse of the other separator char, this way it would even work whenc == '\0'
. I'll give it a benchmark and see if it makes a difference!There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, upon benchmarking this seems to be significantly slower (-20%) in cases, where there is only a single separator char (which I feel like is probably the most common scenario). I'm leaving this open for now if someone has some other ideas about eliminating the branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I can't believe this. Can you show the code?
The branch-predictor does a good job, but such an decrease with branchless operations that can be executed in parallel (instruction level parallelism) is not what I expected.
I believe so too. Maybe it is worth it to special case this (when the branchless variant proves to be definitely slower 😉).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I am misunderstanding something, were you suggesting the following:
to
where vecSep2/3 are defaulted to something that wouldn't interfere with the existing cmp value if only one separator was defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks quite good (expecte the stack-spills I'm seeing / investigating).
If
\0
isn't a concern, so the default values of the argsc{2,3}
could be set to\0
to avoid the branches at creating the vectors. This could a have a positive effect on smaller inputs.With your idea to setting it to
v1
you're on the safe side though.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having nullable arguments, just set the dummy-values in
runtime/src/libraries/System.Private.CoreLib/src/System/String.Manipulation.cs
Lines 1577 to 1578 in 140c2ce
as there it's known. You know what I mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in calling it like
MakeSeparatorListVectorized(ref sepListBuilder, sep0, sep0, sep0)
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep.
Then the checks in
MakeSeparatorListVectorized
fornull
(default arg) can be eliminated.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done 2b0b8f8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This causes the vectors to be spilled to the stack.
asm from loop
It's not exactely the loop-code from this PR, it's with some minor modifications from my attempts to remove the spills. The code from the PR has the same spills, anyway.
I don't understand why the JIT spills the vectors. They could (and should) be kept in the registers, as the
ValueListBuilder<int>
doesn't need any registers. Or is this from the builder's internal usage of span and zeroing behavior, but even if so, the JIT shouldn't spill the vectors...Things I tried to tweak the JIT into not spilling:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it matter though? The vectors don't seem to be used after L1687 anyways?
EDIT: Or is this about shuffleConstant, v1, v2, v3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No.
Perf-wise: yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up removing the need for PEXT and just iterated over each potential index within the extracted vector. For some reason I've been seeing some decent performance gains, even in scenarios with low separator frequency (which to me would point to perf regressions over PEXT). Could this inadvertently have fixed the stack-spills? Either way I can't quite understand the perf uplift.