Investigate improving JIT heuristics with machine learning

### Overview 

There is now fairly substantial literature on improving compiler optimization by leveraging machine learning (ML). A comprehensive survey compiled by [Zheng Wang](https://github.com/zwang4) can be found here: https://github.com/zwang4/awesome-machine-learning-in-compilers.

Here's a closely relevant paper and github repo: [MLGO: a Machine Learning Guided Compiler
Optimizations Framework](https://arxiv.org/pdf/2101.04808.pdf). Like our efforts below it leverages a Policy Gradient algorithm (reinforcement Learning).
* https://github.com/google/ml-compiler-opt

[Several years ago](https://github.com/AndyAyersMS/PerformanceExplorer/blob/master/notes/notes-aug-2016.md) we attempted to leverage ML to create better inline heuristics. That experiment was largely a failure, at least as far as using ML to predict if an inline would improve performance. But one of the key speculations from that work was that lack of PGO was to blame. Now that we have PGO, it is a good time to revisit this area to see if it PGO was indeed the key missing ingredient.

### Proposed Work

During the .NET 9 product cycle, we plan to investigate applying ML techniques to heuristics used by the JIT. The items below represent the initial steps of this investigation. This list will change and grow as the work progresses.

We also want to tackle a relatively simple problem, at least initially. Thus our initial effort will be to try and refine and improve the heuristic the JIT uses for Common Subexpression Elimination (aka CSE):

- [ ] Study the current CSE heuristic and try and understand what it is modelling. Identify apparent shortcomings and limitations.
  - See [initial notes](https://github.com/dotnet/runtime/issues/92915#issuecomment-1743925194) below.
- [x] Develop tooling for varying which CSEs the JIT can perform, either in concert with the current heuristic or independent of it. Measure the impact variation of this on key JIT metrics (performance, code size, jit time, etc).
  - https://github.com/dotnet/runtime/pull/92918
  - https://github.com/dotnet/jitutils/pull/381
  - https://github.com/dotnet/jitutils/pull/383
- [ ] See how well PerfScores can be used as a proxy for measuring actual performance. Ideally, we can leverage PerfScores to avoid needing to run benchmarks repeatedly. But if not, perhaps we can still rely on PerfScores to limit or prioritize the runs needed.
  - See [notes](https://github.com/dotnet/runtime/issues/92915#issuecomment-1744042491) below for some initial data.
- [ ] Try a number of different ML techniques for improving CSE heuristics, both "small step" (evaluating one CSE at a time) and "big step" (evaluate an entire set of CSEs). Identify the key predictive observations. Try both interpretable and black box models.
  - I have been working on a Policy Gradient approach. The implementation is split with part of it in the JIT and the rest in a driver program. Documentation is lacking, so ping me if you want to try this at home.
    - JIT part was introduced in https://github.com/dotnet/runtime/pull/96880
    - Driver part in https://github.com/dotnet/jitutils/pull/389
    - More work on stopping: https://github.com/dotnet/runtime/pull/98063
    - More work on the driver: https://github.com/dotnet/jitutils/pull/391
    - More features https://github.com/dotnet/runtime/pull/98257
    - More work on the driver: https://github.com/dotnet/jitutils/pull/395
    - Streaming SPMI mode https://github.com/dotnet/runtime/pull/98440
    - And client updates to use that mode: https://github.com/dotnet/jitutils/pull/397
    - Enable running in release with fixed parameter set: https://github.com/dotnet/runtime/pull/98729
    - actually try enabling (draft): https://github.com/dotnet/runtime/pull/98776
- [ ] See how feasible it is to have a single architecture-agnostic heuristic (perhaps parameterized like the current heuristic with register info) or whether different heuristics are warranted for say arm64 and x64.
- [ ] Develop a common ML "infrastructure" that we can apply to other similar heuristics in the JIT.
- [x] Set up a CI job to run the Policy Gradient
  - https://github.com/dotnet/runtime/pull/98777 (running the policy, not the training)
- [x] Set up a Perf lab experiment to run the current "best" greedy policy on benchmarks. Tis should be running now, will check on the data in a few days
  - https://github.com/dotnet/runtime/pull/98779
  - https://github.com/dotnet/performance/pull/3975
Also worth noting is Lee Culver's work to adapt the JIT CSE problem into a more standard gymnasium setting: https://github.com/dotnet/runtime/pull/101856

----- 

Update May 2024. 

Given the work above we have been able to produce heuristics that can improve the aggregate perf score for methods via CSE, with about a 0.4% geomean improvement across all methods with CSE candidates. So far those results haven't translated into widespread improvements in our perf lab data. Exactly why is unclear ... data below would suggest one or more of the following:
* poor correlation of perf score with actual perf
* few benchmarks whose scores critically depend on CSEs, or
* those benchmarks whose perf does critically depend on CSEs are all easy/obvious cases
* difficulty extracting small signal from noise (still lots of noisy benchmark results)

The training evaluations show that the best ML heuristic is obtaining about half of what could be gotten from an optimal heuristic. Some recent results:
```
Best/base: 0.9913 [optimizing for  score]
vs Base    0.9957 Better 440 Same 1476 Worse 84
vs Best    1.0044 Better 0 Same 1607 Worse 393

Params     0.6093, 0.7750,-0.5701,-0.6827, 0.5060,-0.0514,-1.3563, 0.4515,-2.0999, 0.0000,-1.0623,-1.3723, 0.0000,-0.6143,-0.8286,-0.2956,-1.1433, 0.3418, 1.3964,-0.0043,-0.4237, 0.6097,-1.9773,-0.3684, 0.7246

Collecting greedy policy data via SPMI... done (27583 ms)
Greedy/Base (score): 34965 methods, 8375 better, 24792 same, 1797 worse,  0.9960 geomean
Best:   76679 @  0.7731
Worst:  82000 @  1.4512

Greedy/Base (size): 34965 methods, 4628 better, 24694 same, 5642 worse,  1.0005 geomean
Best:   15489 @  0.7000
Worst:  82000 @  1.4205
```
 
We don't have any more active work planned on machine learning of heuristics for .NET 9. However, we expect to revisit this area as .NET 9 work winds down, so stay tuned for further development in the coming months.








Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Investigate improving JIT heuristics with machine learning #92915

Overview

Proposed Work

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Investigate improving JIT heuristics with machine learning #92915

Description

Overview

Proposed Work

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions