-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Conversation
@jkotas PTAL |
src/mscorlib/src/System/Buffer.cs
Outdated
PInvoke: | ||
#if AMD64 && !PLATFORM_UNIX | ||
FCopy: | ||
_MemoryCopy(dest, src, len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should go through PInvoke like before so that it does not starve GC for large copies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do. Buffer::BlockCopy (
coreclr/src/vm/comutilnative.cpp
Line 1438 in da44463
JIT_MemCpy(dstPtr, srcPtr, count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we have places like the use of memcpy in Buffer::BlockCopy that can starve GC and lead to long GC pause times. We are trying to not introduce more of them, and fix the ones that are there.
src/mscorlib/src/System/Buffer.cs
Outdated
// check can produce false positives for lengths greater than Int32.MaxInt. But, this is fine because we want to FCALL/PInvoke | ||
// for large lengths anyways. | ||
|
||
if ((len > CopyThreshold) || ((nuint)dest < (nuint)srcEnd)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The (nuint)dest < (nuint)srcEnd
check does not look right. It means we will hit the slow path any time the destination is lower than the source.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it profitable to check for the CopyThreshold here, and not later? It adds overhead for small copies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had re-purposed the if ((nuint)dest - (nuint)src < len) goto PInvoke
overlap check from the original. Shall think of an alternate for this.
private struct Block16 { } | ||
|
||
[StructLayout(LayoutKind.Sequential, Size = 64)] | ||
private struct Block64 { } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it work well / will it work well on ARM64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall special-case the usage of custom structs to AMD64 @jamesqo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
x86 CoreCLR switched to RyuJIT so there shouldn't be any differences between X86 and X64 here, both should use Block64
.
I don't know what ARM64 does but it would make sense for it to use the best available method for moving a 64 bytes block, that should be the same or better than moving 4 bytes at a time.
It may be better to handle this by adding some specific define at the top of the file so it's easy to change:
#if AMD64 || X86
#define HAS_BLOCK64
#endif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW: Have you done the perf testing with crossgened corelib (corelib is always crossgened in shipping configs)? I do not think the 64-byte blocks will make difference for crossgened corelib because of it is limited to SSE2; it should not hurt though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkotas Yes, all of the results posted are with crossgened corelib
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vkvenkat Great work optimizing this. I didn't have much time to look over the code, but one thing I noticed-- why are there Block16
/ Block64
structs but no Block32
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lol I just realized you made that comment a year ago. Still asking if the code is still there though
cc @jamesqo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice changes! I have my own PR to modify MemoryCopy
that I haven't gotten around to finishing yet: #9143 If this is merged I will gladly close mine in favor of yours, however.
src/mscorlib/src/System/Buffer.cs
Outdated
#endif // AMD64 | ||
|
||
// FCALL/PInvoke to the native version when the copy length exceeds the threshold or when the buffers are overlapping. This | ||
// check can produce false positives for lengths greater than Int32.MaxInt. But, this is fine because we want to FCALL/PInvoke |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: int.MaxValue
src/mscorlib/src/System/Buffer.cs
Outdated
if (len > 64) goto MCPY05; | ||
|
||
MCPY00: | ||
// len > 16 && len <= 64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Could be a Debug.Assert
instead of a comment (same for other comments like this)
src/mscorlib/src/System/Buffer.cs
Outdated
|
||
MCPY04: | ||
// len < 4 | ||
if ((len & len) == 0) return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
len & len
is always going to be len.
src/mscorlib/src/System/Buffer.cs
Outdated
return; | ||
|
||
MCPY02: | ||
if ((len & 24) == 0) goto MCPY03; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting that replacing the switch table with all of these branches results in better perf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The switch statement uses a table in memory. And memory is slow. Second it will polluted your memory data cache.
src/mscorlib/src/System/Buffer.cs
Outdated
*(int*)(destEnd - 8) = *(int*)(srcEnd - 8); | ||
*(int*)(destEnd - 4) = *(int*)(srcEnd - 4); | ||
#endif | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are now 2/4 writes being incurred here instead of 1/2 for len == 8. I scanned your test code and it seems to indicate that for random numbers in [0, 8] the change is a net win, but have you tried testing consistently with len == 8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seeing a 10% perf improvement with our version for a million copies with len == 8 on x64-Windows. Looks like eliminating the switch table lead to this, despite the extra write.
src/mscorlib/src/System/Buffer.cs
Outdated
// len >= 4 && len < 8 | ||
*(int*)dest = *(int*)src; | ||
*(int*)(destEnd - 4) = *(int*)(srcEnd - 4); | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, we're incurring an extra write for len == 4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saw an 8% perf drop for len == 4 for a million copies.
src/mscorlib/src/System/Buffer.cs
Outdated
#if BIT64 | ||
*(long*)(dest + i) = *(long*)(src + i); | ||
*(long*)(dest + i + 8) = *(long*)(src + i + 8); | ||
*(Block64*)dest = *(Block64*)src; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SIMD is not enabled for ARM64 yet, I think. I think all of the Block*
usages should be under actually, I believe RyuJIT is the default JIT on x86 now so AMD64
AMD64 || (BIT32 && !ARM)
? /cc @jkotas
*(int*)(dest + 48) = *(int*)(src + 48); | ||
*(int*)(dest + 52) = *(int*)(src + 52); | ||
*(int*)(dest + 56) = *(int*)(src + 56); | ||
*(int*)(dest + 60) = *(int*)(src + 60); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it beneficial to loop unroll this much? In my experiments last year I recall 64 being slightly worse than doing chunks of 48.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, 64 performs better compared to 48. Seeing a regression with 48 for all 3 copy size buckets involved here.
src/mscorlib/src/System/Buffer.cs
Outdated
n--; | ||
if (n != 0) goto MCPY06; | ||
|
||
len = len & 0x3f; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: len &= 0x3f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Also might be better to just write %= 64
to make it more understandable for others.
/cc @benaadams @Anderman |
test Windows_NT_X64 perf |
test Ubuntu14.04 perf |
*(int*)(dest + 4) = *(int*)(src + 4); | ||
*(int*)(dest + 8) = *(int*)(src + 8); | ||
*(int*)(dest + 12) = *(int*)(src + 12); | ||
*(int*)dest = *(int*)src; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should be done as long
for BIT64
src/mscorlib/src/System/Buffer.cs
Outdated
|
||
switch (len) | ||
// PInvoke to the native version when the buffers are overlapping. | ||
if ((((nuint)dest <= (nuint)src) && ((nuint)destEnd > (nuint)src)) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you not using the original condition if ((nuint)dest - (nuint)src < len) goto PInvoke;
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src + len
needs to be re-computed in quite a few places and so created srcEnd
to avoid this. Had optimized the original condition to if ((nuint)dest < (nuint)srcEnd)
and used it in the PR submitted last week. Revised it in favor of a more thorough overlap check as suggested in one of your CR comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had optimized the original condition to if ((nuint)dest < (nuint)srcEnd)
This optimization was incorrect. The original condition had subtle dependency on integer overflow - the rearranging of left side and right side broke it.
Revised it in favor of a more thorough overlap check
I do not think we need a more thorough overlap check here. We just need the original one that had a single compare and jump (unless there is some non-obvious reason why the one with two compares is faster...).
*(int*)(dest + i + 4) = *(int*)(src + i + 4); | ||
*(int*)(dest + i + 8) = *(int*)(src + i + 8); | ||
*(int*)(dest + i + 12) = *(int*)(src + i + 12); | ||
*(int*)dest = *(int*)src; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use long for BIT64
@dotnet-bot test Windows_NT x64 Release Priority 1 Build and Test please |
@vkvenkat Several tests are consistently failing. Could you please take a look?
|
src/mscorlib/src/System/Buffer.cs
Outdated
#if BIT64 | ||
*(long*)dest = *(long*)src; | ||
*(long*)(dest + 8) = *(long*)(src + 8); | ||
if (len <= 48) goto MCPY01; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-- EDIT - I see that the overrun issue is handled by using 'destEnd' and srcEnd'
-- Original comment:
Shouldn't this be goto MCPY02
Going to MCPY01 will result in copying 16 bytes unconditionally and returning.
Which will overrun the end of the dest buffer for values less than 48
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you explained below, MCPY01 copies the last 16 bytes of src to dest, and an overrun doesn't occur here. MCPY01 uses destEnd & srcEnd as opposed to dest & src in other places.
src/mscorlib/src/System/Buffer.cs
Outdated
*(long*)dest = *(long*)src; | ||
*(long*)(dest + 8) = *(long*)(src + 8); | ||
|
||
MCPY01: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should add a comment on each label describing what can be expected when we reach here:
MCPY01: // Unconditionally copy the last 16 bytes using destEnd and srcEnd and return.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have debug asserts after labels like Debug.Assert(len > 16 && len <= 64)
after MCPY00. Do you prefer switching these to comments?
src/mscorlib/src/System/Buffer.cs
Outdated
*(int*)dest = *(int*)src; | ||
*(int*)(dest + 4) = *(int*)(src + 4); | ||
*(int*)(destEnd - 8) = *(int*)(srcEnd - 8); | ||
*(int*)(destEnd - 4) = *(int*)(srcEnd - 4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are using an interesting trick here to prevent the normal overrun that would occur when performing all 8 byte writes when len is not a multiple of 8.
You should explain how this trick works.
Also MCPY01 aqnd MCPY02 perform silimar operations but have a very subtle difference in their input requirements.
I hope that the processor doesn't introduce a stall when len < 8 and we introduce an overlapping write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this trick may need a full upfront overlap check to work ... it may be explaining the failure on Ubuntu.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkotas Thanks. Investigating the test failures. Shall push a fix for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the overlap check to if (((nuint)dest - (nuint)src < len) || ((nuint)src - (nuint)dest < len)) goto PInvoke;
. This should resolve test failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@briansull Added comments after labels to explain copy logic.
@jkotas Addressed latest CR feedback, fixed test failures & updated results. PTAL |
Looks great - thank you! |
Fixes #9161 PR dotnet#9786 fixes perf of span copy of types that don't contain references
Fixes #9161 PR dotnet#9786 fixes perf of span copy of types that don't contain references
Improve span copy of pointers and structs containing pointers Fixes #9161 PR #9786 fixes perf of span copy of types that don't contain references
…#9999) Improve span copy of pointers and structs containing pointers Fixes #9161 PR dotnet#9786 fixes perf of span copy of types that don't contain references
This is just a copy of changes made in dotnet/coreclr#9786
This is just a copy of changes made in dotnet/coreclr#9786
Port optimization in Buffer.Memmove from CoreCLR This is just a copy of changes made in dotnet/coreclr#9786 * Code review feedback
I posted a comment for c6372c5 here, but I don't think that was the right place, I apologize for repeating the remarks here... Just curious why all these improvements to I know everyone wants there to be one central nexus for obtaining an optimal memmove, with proper alignment, chunking, tuning, and everything else done right; shouldn't I also note that this commit removes all address alignment behavior. Now while it's true that the alignment code in ea9bee5 was a bit meek in only considering the destination address--at the possible expense of source alignment: coreclr/src/mscorlib/src/System/Buffer.cs Lines 474 to 503 in ea9bee5
...at least this was something, and a reasonably motivated approach since misaligned store penalties are typically higher than for read. But in this commit there's neither. Did the experiments in #9786 empirically determine that physically aligning these operations is always a waste of time? For severe misalignment between source/target, it would be possible to write a split-access loop, albeit complex, that marshals quad-aligned reads to quad-aligned writes. Was this type of thing evaluated before taking the decision to abandon all efforts at alignment? Finally, note that by insisting on alignment to the destination address exclusively, the alignment code (that was removed in this commit, see above) may have been ruining a naturally ideal source alignment, with the overall result of making things worse. This could have unfairly penalized test results for the existing code, when it would be a trivial matter to only do that adjustment when src/dest are jointly aligned to some appropriate level, e.g., wrap the above code with:
As far as I can tell, there has never been code in buffer.cs that considers the source and destination physical alignment jointly, in order to then proceed sensibly on the basis of 3 broadly defined cases:
|
Commit c6372c5 makes a change to the detection of overlap between the source and destination in coreclr/src/mscorlib/src/System/Buffer.cs Line 262 in ea9bee5
...becomes... coreclr/src/mscorlib/src/System/Buffer.cs Line 271 in c6372c5
I must admit I puzzled over the original code since the unsigned subtraction and compare "incorrectly" fails to identify source/destination overlap in the case where the destination (address) precedes the source (address) in memory. Indeed c6372c5 clarifies this and would also seem to fix a bug, but how did the original work anyway? Then I realized of course that the answer is that the original code is stealthily incorporating a strong assumption about--and dependency upon--the copy direction of the managed The unfortunate upshot is that, while c6372c5 certainly makes the code clearer and more readable, it now excludes all overlap cases and thus no longer allows the performance benefit of fully-managed operation (i.e., non-P/Invoke) in the 50% of cases where the overlap happens to be compatible with the copy direction implemented in Buffer.cs. I blame the original code for not documenting its entirely non-obvious design and/or operation. Anyway, can we revisit this code to restore the optimization that was lost here? Namely, once again allow the managed code path to prevail (still subject to overall size considerations, etc.) instead of P/Invoke for the half of the detected overlaps which comport with the copy direction that's hard-coded into |
This PR optimizes Buffer.MemoryCopy for lengths up to a threshold. Performance measurements were made on a Skylake Desktop - Intel(R) Core(TM)i7-6700K CPU @ 4.00 GHz with 32 GB RAM. We used Windows 10 Enterprise and Ubuntu 14.04. We used a micro-benchmark which exercises multiple buckets of pseudo-random copy sizes from 0 to 4 KB (details below).
Performance improvements with the optimization:
The below charts show the performance speedup:
The below table has absolute clocks in Ticks for million copy operations:
Here's the pseudo-code for the micro-benchmark: