-
Notifications
You must be signed in to change notification settings - Fork 5k
JIT: temporarily enable RLCSEGreedy to see how it fares in CI #98776
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Ah, there's a release mode issue to sort out... one second. |
Diff results for #98776Throughput diffsThroughput diffs for linux/arm64 ran on linux/x64FullOpts (+0.01%)
Throughput diffs for linux/x64 ran on linux/x64Overall (+0.00% to +0.01%)
FullOpts (+0.00% to +0.01%)
Details here Throughput diffs for linux/arm64 ran on windows/x64Overall (+0.00% to +0.03%)
FullOpts (+0.01% to +0.03%)
Throughput diffs for linux/x64 ran on windows/x64Overall (+0.00% to +0.03%)
FullOpts (+0.01% to +0.03%)
Throughput diffs for osx/arm64 ran on windows/x64Overall (+0.00% to +0.03%)
FullOpts (+0.01% to +0.03%)
Throughput diffs for windows/arm64 ran on windows/x64Overall (+0.00% to +0.02%)
FullOpts (+0.01% to +0.02%)
Throughput diffs for windows/x64 ran on windows/x64Overall (+0.00% to +0.03%)
FullOpts (+0.01% to +0.03%)
Details here Throughput diffs for linux/arm ran on windows/x86Overall (+0.00% to +0.02%)
FullOpts (+0.01% to +0.02%)
Throughput diffs for windows/x86 ran on windows/x86Overall (+0.00% to +0.02%)
FullOpts (+0.01% to +0.02%)
Details here |
Diff results for #98776Assembly diffsAssembly diffs for linux/arm64 ran on windows/x64Diffs are based on 2,549,519 contexts (1,019,526 MinOpts, 1,529,993 FullOpts). MISSED contexts: base: 172 (0.01%), diff: 5,238 (0.21%) Overall (-1,000,028 bytes)
FullOpts (-1,000,028 bytes)
Assembly diffs for linux/x64 ran on windows/x64Diffs are based on 2,542,303 contexts (988,245 MinOpts, 1,554,058 FullOpts). MISSED contexts: base: 177 (0.01%), diff: 1,098 (0.04%) Overall (+2,009,297 bytes)
FullOpts (+2,009,297 bytes)
Assembly diffs for osx/arm64 ran on windows/x64Diffs are based on 2,312,793 contexts (945,402 MinOpts, 1,367,391 FullOpts). MISSED contexts: base: 170 (0.01%), diff: 4,920 (0.21%) Overall (-984,144 bytes)
FullOpts (-984,144 bytes)
Assembly diffs for windows/arm64 ran on windows/x64Diffs are based on 2,397,782 contexts (955,693 MinOpts, 1,442,089 FullOpts). MISSED contexts: base: 174 (0.01%), diff: 5,300 (0.22%) Overall (-1,050,688 bytes)
FullOpts (-1,050,688 bytes)
Assembly diffs for windows/x64 ran on windows/x64Diffs are based on 2,429,920 contexts (941,815 MinOpts, 1,488,105 FullOpts). MISSED contexts: base: 176 (0.01%), diff: 891 (0.04%) Overall (+2,536,694 bytes)
FullOpts (+2,536,694 bytes)
Details here Assembly diffs for windows/x86 ran on linux/x86Diffs are based on 2,339,430 contexts (847,225 MinOpts, 1,492,205 FullOpts). MISSED contexts: base: 1 (0.00%), diff: 8,833 (0.38%) Overall (+1,228,297 bytes)
FullOpts (+1,228,297 bytes)
Details here |
Diff results for #98776Assembly diffsAssembly diffs for linux/arm ran on windows/x86Diffs are based on 2,262,678 contexts (832,863 MinOpts, 1,429,815 FullOpts). MISSED contexts: base: 75,600 (3.23%), diff: 76,390 (3.26%) Overall (-3,300,590 bytes)
FullOpts (-3,300,590 bytes)
Details here Throughput diffsThroughput diffs for linux/arm64 ran on windows/x64Overall (+0.15% to +1.37%)
FullOpts (+0.15% to +2.35%)
Throughput diffs for linux/x64 ran on windows/x64Overall (+0.48% to +1.58%)
FullOpts (+0.53% to +2.69%)
Throughput diffs for osx/arm64 ran on windows/x64Overall (+0.15% to +1.38%)
MinOpts (-0.00% to +0.01%)
FullOpts (+0.15% to +2.40%)
Throughput diffs for windows/arm64 ran on windows/x64Overall (+0.18% to +1.37%)
MinOpts (-0.01% to +0.00%)
FullOpts (+0.18% to +2.35%)
Throughput diffs for windows/x64 ran on windows/x64Overall (+0.56% to +1.77%)
FullOpts (+0.57% to +2.99%)
Details here |
Diff results for #98776Assembly diffsAssembly diffs for linux/arm64 ran on windows/x64Diffs are based on 2,549,519 contexts (1,019,526 MinOpts, 1,529,993 FullOpts). MISSED contexts: base: 172 (0.01%), diff: 5,238 (0.21%) Overall (-1,000,028 bytes)
FullOpts (-1,000,028 bytes)
Assembly diffs for linux/x64 ran on windows/x64Diffs are based on 2,542,303 contexts (988,245 MinOpts, 1,554,058 FullOpts). MISSED contexts: base: 177 (0.01%), diff: 1,098 (0.04%) Overall (+2,009,297 bytes)
FullOpts (+2,009,297 bytes)
Assembly diffs for osx/arm64 ran on windows/x64Diffs are based on 2,312,793 contexts (945,402 MinOpts, 1,367,391 FullOpts). MISSED contexts: base: 170 (0.01%), diff: 4,920 (0.21%) Overall (-984,144 bytes)
FullOpts (-984,144 bytes)
Assembly diffs for windows/arm64 ran on windows/x64Diffs are based on 2,397,782 contexts (955,693 MinOpts, 1,442,089 FullOpts). MISSED contexts: base: 174 (0.01%), diff: 5,300 (0.22%) Overall (-1,050,688 bytes)
FullOpts (-1,050,688 bytes)
Assembly diffs for windows/x64 ran on windows/x64Diffs are based on 2,429,920 contexts (941,815 MinOpts, 1,488,105 FullOpts). MISSED contexts: base: 176 (0.01%), diff: 891 (0.04%) Overall (+2,536,694 bytes)
FullOpts (+2,536,694 bytes)
Details here Assembly diffs for linux/arm ran on windows/x86Diffs are based on 2,262,678 contexts (832,863 MinOpts, 1,429,815 FullOpts). MISSED contexts: base: 75,600 (3.23%), diff: 76,390 (3.26%) Overall (-3,300,590 bytes)
FullOpts (-3,300,590 bytes)
Assembly diffs for windows/x86 ran on windows/x86Diffs are based on 2,339,430 contexts (847,225 MinOpts, 1,492,205 FullOpts). MISSED contexts: base: 1 (0.00%), diff: 8,833 (0.38%) Overall (+1,228,297 bytes)
FullOpts (+1,228,297 bytes)
Details here Throughput diffsThroughput diffs for linux/arm ran on windows/x86Overall (+0.30% to +2.08%)
FullOpts (+0.31% to +3.40%)
Throughput diffs for windows/x86 ran on windows/x86Overall (+0.48% to +1.66%)
FullOpts (+0.50% to +2.50%)
Details here Throughput diffs for linux/arm64 ran on linux/x64Overall (+0.09% to +1.28%)
FullOpts (+0.09% to +2.26%)
Throughput diffs for linux/x64 ran on linux/x64Overall (+0.45% to +1.48%)
FullOpts (+0.50% to +2.62%)
Details here |
Overall ImpressionIt appears it has done a decent job of reducing perf scores, especially on Win x64, which was the only os/arch I used for training. Perf ScoreMy training runs projected around a 0.4% improvement in perf scores, but this was just for methods with CSEs, so it is a bit hard to project it to aggregate diffs across entire collections or just methods w/diffs, since method with CSEs but no diffs and methods w/o CSEs will confound things. I will fix my local metric to at least collect the w/o diff data in the future. Despite that, looking at the detailed CI data the perf score aggregate across all methods shows improvements in all win x64 collections (sadly asp.net seems to be out of data again, will have to fix that). The policy was trained on a 100 method sample from an older asp.net MCH so few or perhaps none of the methods in the diffs above were used as part of training. So there does not seem to be evidence of overfitting; the learned policy seems to handle methods it has never seen passably well. We don't have enough experience with perf score diffs to judge if these results are significant. Perhaps these sorts of diffs are easily obtainable. Code SizeCode size impact on x64 is not great. I have looked at this some and the current algorithm is biased against the 10-byte constant class handle CSEs for some reason (that is the CostSz feature (parameter 4 on your program) has "downvote" weight of Oddly there are some good code size reductions on arm64. No idea why. One thought is that perhaps we should simply train the policy on code size and not perf score, as it is also likely well correlated with perf, and less sensitive to profile weight shenanigans. Doing this is conceptually simple but I need to add metric tracking for code size into the driver program. ThroughputAlso not great. This is a little surprising as the CSE algorithm should not be that costly. Will have to dig in deeper there. At first blush it might just be the additional code size, but we have some surprising code size reductions on arm64 and throughput is no better off there. |
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsAm going to use this as a place to capture observations on the evolved heuristic. I have done a bit of analysis locally already and will add those notes here. One of the key challenges will be figuring out how to try and fix the various problems without losing the benefits.
|
benchmarks.run 7786Here's one size regression analysis. I need to find ways to do these faster. @jakobbotsch you mentioned something about automated regression analysis?
So not only a code size regression, but also a perf score regression. Clearly we do a lot more CSEs now... assuming order doesn't matter, this is
So it is a superset. Using MCMC to gauge the space of opportunities here we see we can do better than BASE
and sorting
where In the initial ranking these CSEs rank pretty low, but rank above
and by the time we've gone through and CSE'd all the higher-ranked candidates things haven't changed much, other than the
What are these CSEs?
So low cost, "containable" (though not marked as such), 2 defs and 2/4 uses, live across call. What goes wrong? Doing these CSE causes V16 DIFF T01 to be spilled
Some thoughts on what might be mis-modelled here:
CSE 01's def is to a local
after CSE
|
benchmarks.run 18102One cse, base does, diff doesn't
Here we have "downvotes" from CostSz (-2.3) LsraLA (-1.6), LA (-4.2), Const(-1.3), Const+LA (-2.058) But no pressure whatsoever, perf score is better... just the large constants that don't get CSEd. Opt Goal was "Small code" as this is a cctor -- we don't have a heuristic for this (yet). Seems like we ought to make one, similar to what we're doing for PerfScore. On that note I have started adding support for code size as an optimization objective. In the initial cut I've just added it to MCMC, to get a rough feeling for how often optimizing for score and optimizing speed coincide or are at odds with one another. Here is some sample data (200 randomly chosen methods).
Parsing this ... with a score-optimal CSE policy we can improve perf scores by about 1.6% (but increase code size by 1.8%); with a size-optimal policy we can reduce code size by about 1.9% and keep perf scores about the same as they are now. Seems to suggest that if we want to improve perf scores via CSE over our current heuristic, we're going to have to accept some size increase. Caveat: I don't yet surface the optimization objective so comparisons vs baseline are a bit tricky, as it changes its heuristic. This is using the average cross-result, however... that is to compute the size impact for best perf score, I find all experiments with the best perf score, and compute the average size (and likewise for score). With sufficiently clever training perhaps we could do better than these averages? I will add yet another metric to find the "best/best" and see if materially differs from "best/average". I don't have the ability to optimize a mixed objective (yet?) so can't say what the actual tradeoff curve of size and score might look like or whether something like best/best is achievable (given that we're still working on finding a policy that can get the first "best"). Follow-up experiments:
Using a best/best MCMC estimator (that is, for each method, find the best perf score, then among those, find the best code size) we get data like
So a policy that can magically get the best perf score (1.6%) and then the best code size has about a 1.5% size increase, and a policy that can magically get the best code size (1.9%) and then the best perf score will see about a 0.5% perf score decrease. So a bit less pessimistic than the above. I would like to grow this into a full-fledged Pareto Frontier; what we have here are the endpoints. But it is not immediately obvious to me how to do that in aggregate; each method comes with its own set of tradeoffs. Some thoughts:
|
I made some updates so that MCMC can track the pareto frontiers for the methods it explores. here's one such (score and size normalized by the current jit heuristic score/size). Note these "curves" must pass through or below (1,1); here it passes through, meaning that we can either have smaller code or faster code but not both. The lines joining the points are fictional as the observations are discrete, but they help visualize the nature of tradeoff. Also note we're at the mercy of MCMC's exploration strategy; it may be we should be doing more extensive random sampling and a more thorough exploration would change the shape of the curve in interesting ways. Will have to experiment some here. Broadening this out to a bigger set of methods, here are 200 randomly chosen method pareto frontiers: Roughly speaking we see 3 classes of methods here:
It would be interesting to see if there's some simple way to classify a method (based on what is known at the point where we do CSEs). |
Looking at TP diff
Both the cost of the heuristic and the cost of the extra CSEs show up here. Will focus on the heuristic cost as the actual policy will fluctuate as we tune the parameters. |
Also likely running the greedy heuristic on optimize for size cases is a contributor. I am trying to create a size-optimized parameter set now; depending on how that goes I might decide to use that or initially disable the greedy heuristic for size cases. |
Profiling showed that `GetFeatures` was a major factor in throughput. For the most part the features of CSE candidates don't change as we perform CSEs, so build in some logic to avoid recomputing the feature set unless there is some evidence features have changed. To avoid having to remove already performed candidates from the candidate vector we now tag them them as `m_performed`l these get ignored during subsequent processing, and discarded if we ever recompute features. This should cut the TP impact roughly in half, the remaining part seems to largely be from doing more CSEs (which we hope will show some perf benefit). Contributes to dotnet#92915.
TP did improve a bit... Throughput diffs for windows/x64 ran on windows/x64Overall
|
Tail counts seem to be fairly even pretty... (recall this is diff/base ratio, so < 1 is an improvement, > 1 is a regression):
Ideally, we'd see some leftward skew here, more big improvements than big regressions. So in some respects, data is still looking mostly like noise... but again we only have 3-4 days worth of numbers right now, so let's look again next week when there's more. |
Draft Pull Request was automatically closed for 30 days of inactivity. Please let us know if you'd like to reopen it. |
Am going to use this as a place to capture observations on the evolved heuristic.
I have done a bit of analysis locally already and will add those notes here. One of the key challenges will be figuring out how to try and fix the various problems without losing the benefits.