Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JIT: Add 3-opt implementation for improving upon RPO-based layout #103450

Merged
merged 48 commits into from
Nov 4, 2024

Conversation

amanasifkhalid
Copy link
Member

@amanasifkhalid amanasifkhalid commented Jun 13, 2024

Part of #107749. After establishing a baseline layout, where hot blocks are in relative reverse post-order and cold blocks are out-of-line, run a 3-opt pass over each try region, as well as the main method body. Profitability of each partition is currently measured with edge weights only, without considering the type ([un]conditional, forward/backward, etc) of the branch. To minimize the search space of partition points, we maintain a priority queue of flow edges between non-contiguous blocks, such that we can attempt to create fallthrough for the next most-profitable edge.

To avoid disturbing EH semantics too much, we currently only reorder blocks within EH regions. This could be relaxed somewhat to at least move region entries up to their hottest predecessors, effectively moving the entire region once we re-establish contiguity -- I plan to try this in a follow-up PR. Note that we don't consider reordering handler regions beyond the initial RPO-based layout, under the assumption that the overhead of exception handling usually minimizes the potential gains from better layout.

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jun 13, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

improvedLayout = false;
BasicBlock* const exitBlock = blockVector[blockCount - 1];

for (unsigned i = 1; i < (blockCount - 1); i++)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the root of the TP cost is here -- we want to avoid having to search for possible cut points.

One approach is to just pick randomly, but I think we can do better for now. Roughly speaking in the pass above we should find all blocks that are either not just before their optimal successor and/or not just after their optimal successor.

We can rank these by the difference in the current vs optimal score. Then greedily pick the worst, that gives the first cut point. For the second cut point you can pick the best pred for the first cut point's current next block, or the best succ for the current pred of the first cut point's ideal successor. That is, if we have

S ~~~~ 1|2 ~~~ 3|4 ~~~ 5|6 ~~~ E

1's ideal succ is 4

reordering is

S ~~~~ 1|4 ~~~ 5|2 ~~~ 3|6 ~~~ E

So we either try and find a 5 which is the ideal pred of 2, or a 6 which is the ideal succ of 3.

Failing that we might pick some other block that is not currently followed by its ideal succ.

So one idea is to keep 3 values for each block: its min score, current score, and best score (lower is better). Order the blocks by current-min. Pick of the best as the first split, and then see if any of the next few provide a good second split.

Likely though this ends up needing a priority queue or similar as once we accept an arrangement we need to update some of the costings...

Copy link
Contributor

Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it.

@github-actions github-actions bot locked and limited conversation to collaborators Aug 14, 2024
@amanasifkhalid
Copy link
Member Author

/azp run runtime-coreclr outerloop, Fuzzlyn, Antigen

@amanasifkhalid amanasifkhalid marked this pull request as ready for review October 30, 2024 01:56
@amanasifkhalid
Copy link
Member Author

cc @dotnet/jit-contrib, @AndyAyersMS PTAL. Diffs are a bit broken on Linux at the moment, though other platforms show less dramatic churn than I expected -- it's possible the current cost model isn't sophisticated enough to incentivize larger bets. PerfScore diffs look like a net improvement, and TP cost is pretty cheap. Antigen and Fuzzlyn failures don't look related to layout. Thanks!

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some preliminary notes.

I find it a little confusing to use "cost" if the goal of the optimization is to maximize. Maybe consider "benefit"? Or rework things to be minimizing cost?

I also think it would be useful to build the overall score for a layout; you should be able to show at each step this value is increasing (/decreasing if you minimize).

When we reorder it seems like there are six impacted blocks (3 have new preds, 3 have new succ), so I'm surprised we are just reconsidering 3 sets of edges. Maybe this is related to not wanting to mess with the boundaries? But even so I'd expect to see 4 sets...

I would recommend adding a doc with some simple examples worked out in pseudocode.

//
/* static */ bool Compiler::ThreeOptLayout::EdgeCmp(const FlowEdge* left, const FlowEdge* right)
{
return left->getLikelyWeight() < right->getLikelyWeight();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a good idea to give this an equality tie breaker, so that EdgeCmp(a,b) = !EdgeCmp(b,a)

You can use the bbIDs of sources and targets, for instance.

return;
}

// Don't waste time reordering within handler regions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe note that we expect if a finally is "hot" we should be cloning it.

{
assert(edge != nullptr);

BasicBlock* const srcBlk = edge->getSourceBlock();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to keep track of edges that can never be handled? Perhaps a second tracked set?

Alternatively, we could keep some state on the edge itself, there is likely some padding void there already that could hold bits.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to keep track of edges that can never be handled? Perhaps a second tracked set?

I suppose I could just move the set insertion/membership check to the beginning of ConsiderEdge? This is a no-diff change for me locally, and it seems like a convenient way to short-circuit all the checks.

Alternatively, we could keep some state on the edge itself, there is likely some padding void there already that could hold bits.

That sounds worth pursuing. I'll try this out.

@amanasifkhalid
Copy link
Member Author

@AndyAyersMS thanks for the review!

I find it a little confusing to use "cost" if the goal of the optimization is to maximize. Maybe consider "benefit"? Or rework things to be minimizing cost?

Agreed that the semantics looked weird. I've reframed the partition "cost" as its "score," such that we want to maximize the score of a layout by creating fallthrough for the heaviest edges. Personally, I find the cost-minimizing framing of this problem more intuitive, but if/when the model deviates from just using edge weights, I think it will be easier to compute the benefit rather than the cost of an edge (assuming the cost is computed by subtracting the edge's score from all other flow out of the source block, which may not be the same as the source block's weight).

I also think it would be useful to build the overall score for a layout; you should be able to show at each step this value is increasing (/decreasing if you minimize).

Do you intend for this to be a debug-only utility, i.e. to improve the JitDump info reported by 3-opt? Or is there some strategic advantage to knowing the total layout score during 3-opt?

When we reorder it seems like there are six impacted blocks (3 have new preds, 3 have new succ), so I'm surprised we are just reconsidering 3 sets of edges. Maybe this is related to not wanting to mess with the boundaries? But even so I'd expect to see 4 sets...

Yeah, this bit of logic hasn't aged well over the last few iterations. I've simplified it a bit, and diffs against the last revision seem to show we're capturing more reordering opportunities.

I would recommend adding a doc with some simple examples worked out in pseudocode.

Is it alright if I split that out into a separate PR?

@amanasifkhalid
Copy link
Member Author

/azp run runtime-coreclr outerloop, Fuzzlyn, Antigen

@AndyAyersMS
Copy link
Member

I also think it would be useful to build the overall score for a layout; you should be able to show at each step this value is increasing (/decreasing if you minimize).

Do you intend for this to be a debug-only utility, i.e. to improve the JitDump info reported by 3-opt? Or is there some strategic advantage to knowing the total layout score during 3-opt?

Mostly debug, I guess. When we did this way back when we built the global scorer first and tried to ensure that a good score looked like a good layout, and that a bad score looked like a bad layout, and that there were sufficient methods with bad layouts (and a set of "bad" candidates to look at).

But if you know how close the score is to a good score, you can also use this for early stopping.

@AndyAyersMS
Copy link
Member

I would recommend adding a doc with some simple examples worked out in pseudocode.

Is it alright if I split that out into a separate PR?

Yes, sure.

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

unsigned position = 0;

// Initialize current block order
for (BasicBlock* const block : compiler->Blocks(compiler->fgFirstBB, finalBlock))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to fuse this with the loop just above?

Copy link
Member Author

@amanasifkhalid amanasifkhalid Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Splitting the loops up allows us to calculate the exact size of blockOrder/tempOrder before they're needed. Instead of allocating the exact size needed, I can use fgBBcount for the array sizes (and maybe trim this upper bound as we search for the last hot block). It'll waste some memory in some cases, but then we only need one loop.

}
#endif // DEBUG

ThreeOptLayout layoutRunner(this);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be able to save a tiny bit of TP by not bothering to search when there is just one or two blocks, since there is nothing to reorder.

(likewise in the above if there is just a try with one or two blocks)

@amanasifkhalid
Copy link
Member Author

TP diffs show about 0.02% improvement compared to the last run.

@amanasifkhalid
Copy link
Member Author

@AndyAyersMS thanks for the reviews! Are you ok with this revision?

Copy link
Member

@AndyAyersMS AndyAyersMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@amanasifkhalid
Copy link
Member Author

/ba-g macOS pipelines are blocked by deprecated images

@amanasifkhalid amanasifkhalid merged commit 1c10cee into dotnet:main Nov 4, 2024
86 of 104 checks passed
@amanasifkhalid amanasifkhalid deleted the k-opt-layout branch November 4, 2024 18:18
@LoopedBard3
Copy link
Member

Issue tracking regressions from this PR (since GitHub doesn't seem to want to do autolinks at the moment): #109613

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants