Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PGO: Profile Buffer.Memmove's length #96311

Merged
merged 22 commits into from
Jan 5, 2024
Merged

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Dec 25, 2023

Closes #84192

This PR adds a new generic way to profile integer values for various needs and, as an example, uses it to profile Buffer.Memmove's size in order to unroll it for the most popular size, example:

Buffer.Memmove(dst, src, len); // non-constant len

to:

if (len == 42)
    Buffer.Memmove(dst, src, 42);  // will be unrolled & inlined
else
    Buffer.Memmove(dst, src, len); // fallback

if Instrumented Tier0 says that len is usually 42.

Codegen example:

[MethodImpl(MethodImplOptions.NoInlining)]
void Test(ReadOnlySpan<int> src, Span<int> dst) => src.CopyTo(dst);

Current Tier1 codegen:

; Assembly listing for method Program.Test
       sub      rsp, 40
       mov      rax, bword ptr [rcx]
       mov      r8d, dword ptr [rcx+0x08]
       mov      rcx, bword ptr [rdx]
       mov      edx, dword ptr [rdx+0x08]
       cmp      r8d, edx
       ja       SHORT G_M34050_IG04
       shl      r8, 2
       mov      rdx, rax
       call     [System.Buffer:Memmove(byref,byref,ulong)]
       nop      
       add      rsp, 40
       ret      
G_M34050_IG04:  ;; offset=0x0029
       call     [System.ThrowHelper:ThrowArgumentException_DestinationTooShort()]
       int3 

New codegen if this method is invoked for src being 10 integers (10*4=40 bytes) according to PGO data:

; Assembly listing for method Program.Test
       sub      rsp, 40
       vzeroupper 
       mov      rax, bword ptr [rcx]
       mov      ecx, dword ptr [rcx+0x08]
       mov      r10, bword ptr [rdx]
       mov      edx, dword ptr [rdx+0x08]
       cmp      ecx, edx
       ja       SHORT G_M34050_IG06
       mov      r8d, ecx
       shl      r8, 2
       cmp      r8, 40   ;; is size == 40 (PGO-driven) ?
       jne      SHORT G_M34050_IG05
       ;;
       ;; unroll to perform 2 SIMD moves for length = 40
       vmovdqu  ymm0, ymmword ptr [rax]
       vmovdqu  xmm1, xmmword ptr [rax+0x18]
       vmovdqu  ymmword ptr [r10], ymm0
       vmovdqu  xmmword ptr [r10+0x18], xmm1
       ;;
G_M34050_IG04:
       add      rsp, 40
       ret      
G_M34050_IG05:
       ;; Fallback (size != 40)
       mov      rcx, r10
       mov      rdx, rax
       call     [System.Buffer:Memmove(byref,byref,ulong)]
       jmp      SHORT G_M34050_IG04
G_M34050_IG06:
       call     [System.ThrowHelper:ThrowArgumentException_DestinationTooShort()]
       int3

Benchmark:

int[] _dst = new int[100];

public static IEnumerable<int[]> Data()
{
    yield return new int[2];  // 8 bytes
    yield return new int[10]; // 40 bytes
    yield return new int[30]; // 120 bytes
}

[Benchmark]
[ArgumentsSource(nameof(Data))]
public void CopyTo(int[] src) => src.AsSpan().CopyTo(_dst);
Method Toolchain src Mean
CopyTo \runtime-main...\corerun.exe Int32[2] 1.4895 ns
CopyTo \runtime-PR...\corerun.exe Int32[2] 0.4267 ns
CopyTo \runtime-main...\corerun.exe Int32[10] 1.7304 ns
CopyTo \runtime-PR...\corerun.exe Int32[10] 0.5909 ns
CopyTo \runtime-main...\corerun.exe Int32[30] 2.0815 ns
CopyTo \runtime-PR...\corerun.exe Int32[30] 1.0008 ns

Sadly, for now it requires R2R=0 to work properly, because:

  1. We use optimizations for InstrumentedTier for R2R'd code (and the corelib is always R2R'd)
  2. We don't instrument inlinees yet

I have some workarounds for it in mind for future PRs.

  • SpanHelpers.SequenceEqual (to unroll fast path)
  • String.Equals (to unroll fast path)
  • a / b (divisor's value)

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 25, 2023
@ghost ghost assigned EgorBo Dec 25, 2023
@ghost
Copy link

ghost commented Dec 25, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Closes #84192

This PR adds a new generic way to profile integer values for various needs and uses it to profile Buffer.Memmove's size in order to unroll it for the most popular size, example:

[MethodImpl(MethodImplOptions.NoInlining)]
void Test(ReadOnlySpan<int> src, Span<int> dst) => src.CopyTo(dst);

Current Tier1 codegen:

; Assembly listing for method Program.Test
       sub      rsp, 40
       mov      rax, bword ptr [rcx]
       mov      r8d, dword ptr [rcx+0x08]
       mov      rcx, bword ptr [rdx]
       mov      edx, dword ptr [rdx+0x08]
       cmp      r8d, edx
       ja       SHORT G_M34050_IG04
       shl      r8, 2
       mov      rdx, rax
       call     [System.Buffer:Memmove(byref,byref,ulong)]
       nop      
       add      rsp, 40
       ret      
G_M34050_IG04:  ;; offset=0x0029
       call     [System.ThrowHelper:ThrowArgumentException_DestinationTooShort()]
       int3 

New codegen if this method is mostly invoked for src being 10 integers (10*4=40 bytes):

; Assembly listing for method Program.Test
       sub      rsp, 40
       vzeroupper 
       mov      rax, bword ptr [rcx]
       mov      ecx, dword ptr [rcx+0x08]
       mov      r10, bword ptr [rdx]
       mov      edx, dword ptr [rdx+0x08]
       cmp      ecx, edx
       ja       SHORT G_M34050_IG06
       mov      r8d, ecx
       shl      r8, 2
       cmp      r8, 40
       jne      SHORT G_M34050_IG05
       ;;
       ;; unroll to perform 2 SIMD moves for length = 40
       vmovdqu  ymm0, ymmword ptr [rax]
       vmovdqu  xmm1, xmmword ptr [rax+0x18]
       vmovdqu  ymmword ptr [r10], ymm0
       vmovdqu  xmmword ptr [r10+0x18], xmm1
G_M34050_IG04:
       add      rsp, 40
       ret      
G_M34050_IG05:
       mov      rcx, r10
       mov      rdx, rax
       call     [System.Buffer:Memmove(byref,byref,ulong)]
       jmp      SHORT G_M34050_IG04
G_M34050_IG06:
       call     [System.ThrowHelper:ThrowArgumentException_DestinationTooShort()]
       int3

Open questions/items:

  • Can we re-use handle-histogram for it like I did in this PR or should we create a new entity (and change the PGO/MIBC format)?
  • Compressed constants for Static PGO MIBC ?

The current impl is a bit fragile due to lack of context-sensitive profiling but it's not a subject to change as part of this PR. The newly added infra can be used for other things we might want to profile in tier0.

Author: EgorBo
Assignees: EgorBo
Labels:

area-CodeGen-coreclr

Milestone: -

@EgorBo

This comment was marked as resolved.

This comment was marked as resolved.

@EgorBo
Copy link
Member Author

EgorBo commented Dec 25, 2023

/azp run runtime-coreclr pgo, runtime-coreclr libraries-pgo, runtime-coreclr pgostress

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@EgorBo EgorBo marked this pull request as ready for review December 26, 2023 13:05
@EgorBo
Copy link
Member Author

EgorBo commented Dec 26, 2023

PTAL @AndyAyersMS @dotnet/jit-contrib when you have time

I didn't change the PGO format (I am re-using type-histogram). I am using ssize_t - it probably can be reduced down to int16_t here, however, value probes for now are expected to be rare.

@EgorBo
Copy link
Member Author

EgorBo commented Dec 26, 2023

/azp run runtime-coreclr pgo, runtime-coreclr libraries-pgo, runtime-coreclr pgostress

Copy link

Azure Pipelines successfully started running 3 pipeline(s).

Comment on lines 53 to 56
ValueHistogramIntCount = (DescriptorMin * 8) | FourByte | AlignPointer,
ValueHistogramLongCount = (DescriptorMin * 8) | EightByte,
ValueHistogram = (DescriptorMin * 9) | EightByte,
GetLikelyValue = (DescriptorMin * 10) | EightByte,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These need proper handling in dotnet-pgo and crossgen2. Also, GetLikelyValue should either be removed or should be computed by crossgen2.

cc @davidwrighton, this is once again going to break backwards compatibility of a .mibc produced by a new dotnet-pgo and consumed by an old crossgen2. Do we need to do anything special here? Also, do we need to bump R2R minor version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jakobbotsch I've removed the compressed GetLikelyValue for now, will implement it separately.

What kind of additional handling for dotnet-pgo and crossgen2 we talk about besides the compressed one?

I've validated that dotnet-pgo works properly, e.g.:

image

(dotnet-pgo dump mibc)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you plan to implement it separately, it makes sense to add it now (and implementing the minimum format-handling stuff for it) so we don't need to update the format twice.

What kind of additional handling for dotnet-pgo and crossgen2 we talk about besides the compressed one?

Seems like PgoFormat.Merge needs to be updated at least. Looks like the encoder/decoder (EncodePgoData/ParsePgoData) might be ok since this reuses EightByte. That's the one where backwards compatibility was broken when I added MethodHandle last time, so maybe this change doesn't break backwards compatibility.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I put my request changes in to cover this sort of thing. I think you might be OK, but what you need to do is validate that the new instrumentation doesn't cause problems with old dotnet-pgo/crossgen2. If it does, we can have another discussion, but the first thing to see is if it doesn't. The MIBC format is designed to allow new things to appear, but this only works in a subset of ways that it is possible to add new content to the file, and @jakobbotsch didn't follow those restrictions with the last PGO data format change he made. (Honestly, though, that was my fault, as I'd forgotten about them.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and when you do this validation, please write down what scenarios you tested. I'll probably want to add a few weird wrinkles to it all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated dotnet-pgo (merge and dump) and SPMI to be aware of value probes but it looked like they were fine as is too (didn't crash).

also, I've done some testing that @davidwrighton requested and published my results and steps via mail

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just two minor nits on comments/dump text.

src/coreclr/jit/fgprofile.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/fgprofile.cpp Show resolved Hide resolved
@BruceForstall
Copy link
Member

You need to change the JIT-EE GUID.

Do this need SPMI changes?

Copy link
Member

@davidwrighton davidwrighton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do some verification to find out if this sort of profiling data breaks old copies of dotnet-pgo. We have customers who do not always use the latest dotnet-pgo on their data, and it would be good to be compatible.

@ghost ghost added the needs-author-action An issue or pull request that requires more info or actions from the author. label Jan 3, 2024
@ghost ghost removed the needs-author-action An issue or pull request that requires more info or actions from the author. label Jan 4, 2024
EgorBo and others added 2 commits January 4, 2024 01:36
Co-authored-by: Andy Ayers <andya@microsoft.com>
@EgorBo
Copy link
Member Author

EgorBo commented Jan 4, 2024

You need to change the JIT-EE GUID.

Do this need SPMI changes?

Yep, made changes in b06db31 and updated the guid.

@EgorBo
Copy link
Member Author

EgorBo commented Jan 5, 2024

Please do some verification to find out if this sort of profiling data breaks old copies of dotnet-pgo. We have customers who do not always use the latest dotnet-pgo on their data, and it would be good to be compatible.

Addressed internally

@EgorBo EgorBo merged commit adb4c18 into dotnet:main Jan 5, 2024
115 checks passed
@EgorBo EgorBo deleted the generic-probing-memmove branch January 5, 2024 23:30
@github-actions github-actions bot locked and limited conversation to collaborators Feb 5, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PGO for Memmove unrolling
5 participants