Skip to content

Conversation

amanasifkhalid
Copy link
Contributor

@amanasifkhalid amanasifkhalid commented Mar 25, 2025

Part of #107749. The latter half of the JIT frontend contains numerous optimizations that rely on block weights to compute the profitability of their transformations, and may benefit from more consistent profile data. Morph and flow opts frequently redirect flow out of blocks and into other ones, and the corresponding tweaks to the profile are usually too local to reduce/increase flow along affected paths. This gives downstream phases inaccurate ideas about which blocks are newly cold/hot, and other profile transformations (such as fgExpandRarelyRunBlocks) can further propagate inaccuracies. Thus, re-running synthesis shortly after morph seems like an opportune and cheap place to fix the profile. Like the late profile synthesis run, we aren't interested in changing edge likelihoods here -- we just want to propagate changes in block weights through the flowgraph.

I also included some dead code cleanup I meant to do in a previous PR.

@Copilot Copilot AI review requested due to automatic review settings March 25, 2025 21:00
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review any files in this pull request.

Files not reviewed (3)
  • src/coreclr/jit/compiler.cpp: Language not supported
  • src/coreclr/jit/compiler.h: Language not supported
  • src/coreclr/jit/fgbasic.cpp: Language not supported

@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 25, 2025
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@amanasifkhalid
Copy link
Contributor Author

Collections are a bit impoverished at the moment; I'll wait for the next collection run

@amanasifkhalid
Copy link
Contributor Author

Looking at aspnet, I see large size increases and decreases due to diffs in loop cloning. For example, here's the diffed loop stats for Microsoft.EntityFrameworkCore.Metadata.Conventions.Internal.ConventionDispatcher+ImmediateConventionScope:OnEntityTypeAdded(Microsoft.EntityFrameworkCore.Metadata.Builders.IConventionEntityTypeBuilder):Microsoft.EntityFrameworkCore.Metadata.Builders.IConventionEntityTypeBuilder:this:

LoopsFoundDuringOpts : 2
LoopsInverted : 0
- LoopsCloned : 1
+ LoopsCloned : 0
LoopsUnrolled : 0

And for System.Threading.SpinWait:SpinUntil(System.Func`1[ubyte],int):ubyte:

LoopsFoundDuringOpts : 1
LoopsInverted : 0
- LoopsCloned : 1
+ LoopsCloned : 0
LoopsUnrolled : 0

There are plenty of instances where we clone more as well, like in System.Linq.Enumerable:TryGetSingle[System.__Canon](System.Collections.Generic.IEnumerable`1[System.__Canon],System.Func`2[System.__Canon,ubyte],byref):System.__Canon:

LoopsFoundDuringOpts : 1
LoopsInverted : 2
- LoopsCloned : 0
+ LoopsCloned : 1
LoopsUnrolled : 0

(Note that lexical loop inversion found more loop candidates than the graph-based approach did.) Also System.Collections.Immutable.ImmutableSortedDictionary`2+Node[int,System.ValueTuple`2[System.Nullable`1[int],System.Nullable`1[int]]]:Search(int,System.Collections.Generic.IComparer`1[int]):System.Collections.Immutable.ImmutableSortedDictionary`2+Node[int,System.ValueTuple`2[System.Nullable`1[int],System.Nullable`1[int]]]:this:

LoopsFoundDuringOpts : 1
LoopsInverted : 0
- LoopsCloned : 0
+ LoopsCloned : 1
LoopsUnrolled : 0

Multiple collections have duplicate contexts with diffs in cloning, thus inflating the size diffs in both directions. We also have smaller diffs due to IV opts and CSEs being newly (un)profitable. For example, in System.RuntimeType+RuntimeTypeCache+MemberInfoCache`1[System.__Canon]:MergeWithGlobalListInOrder(System.__Canon[]):this:

- LoopsIVWidened : 1
- WidenedIVs : 1
- UnusedIVsRemoved : 0
- CseCount : 1
+ LoopsIVWidened : 2
+ WidenedIVs : 2
+ UnusedIVsRemoved : 1
+ CseCount : 3

Because synthesis does not mark handler blocks as rarely run, I'm seeing optimizations that use BasicBlock::isRunRarely as a cutoff -- like TLS expansion -- kicking in more often. I'm also seeing fgOptimizeBranch kicking in more/less often due to changes in profitability. All of this seems expected, and we aren't allowing synthesis to change edge likelihoods here, so this pass isn't destructive to the profile.

@amanasifkhalid
Copy link
Contributor Author

cc @dotnet/jit-contrib, @AndyAyersMS PTAL. Diffs show a TP cost of up to 0.5% (some of this is probably from some opts like loop cloning kicking in more often now). I plan to disable fgExpandRarelyRunBlocks when we have PGO data as a follow-up to this, so we can expect to get some TP back. Thanks!

@AndyAyersMS
Copy link
Member

Can you look at System.Buffers.SharedArrayPool1[System.__Canon]:Trim():ubyte:this (Tier1)` ( benchmarks-pgo) and see if the changes there look reasonable?

The GDV-based cloning heuristics are profile sensitive, and I'd like to be sure we think the new decision here is the right one.


// Re-establish profile consistency, now that inlining and morph have run.
//
DoPhase(this, PHASE_REPAIR_PROFILE, &Compiler::fgRepairProfile);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally use unique phase IDs for each phase, so can you add a new one for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

@amanasifkhalid
Copy link
Contributor Author

amanasifkhalid commented Mar 27, 2025

Can you look at

Sure, thanks for pointing this particular example out: I noticed that for the post-importation synthesis calls, if the profile is still inconsistent after recomputing the block weights, we defer the post-phase checks. The importer has a check for re-enabling these checks under the assumption that bad IL can create flow that trips up synthesis, so any code that passes importation should have a reconcilable profile. If later synthesis calls fail to make the profile consistent, the deferred checks are never re-enabled since that's an importer-specific quirk. It turns out the later profile repair phases have been quietly failing to make the profile consistent in several cases, including the above method. In particular, the flow into a call-finally pair doesn't always make it out, and this loss of flow is then propagated to downstream blocks. This might be why we're no longer finding loop cloning profitable in so many cases.

Let me try fixing this in another PR first, since this will affect the pre-layout profile repair phase too.

@amanasifkhalid
Copy link
Contributor Author

amanasifkhalid commented Mar 28, 2025

Let me try fixing this in another PR first

#114016

@amanasifkhalid
Copy link
Contributor Author

/azp run runtime-coreclr libraries-pgo

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@amanasifkhalid
Copy link
Contributor Author

libraries-pgo dead-lettered on linux-x64. Otherwise, everything is clean.

@amanasifkhalid
Copy link
Contributor Author

Diffs are still big, though the TP impact is lessened somewhat by synthesis being able to reuse the flowgraph annotations from previous phases.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants