Description
Description
This issue addresses the behavior of our default job for running microbenchmarks in this repository. With our current settings, the warmup phase may not always reach full optimization (Tier1) and produces incorrect results.
Current Configuration
The default job uses the following settings (see RecommendedConfig.cs#L36-L39):
- Warmup Count: 1
- Iteration Time: 250ms
- Min Iteration Count: 15
- Max Iteration Count: 20
In addition, Tiered Compilation and Dynamic PGO are enabled by default (except for our nodynamicpgo
jobs).
Background on Dynamic PGO
According to the documentation on tier promotion with Dynamic PGO (link):
- The first 30 method invocations run at Tier0.
- The next 30 method invocations use Tier0Instrumented.
- Only after these 60 invocations does the method get promoted to Tier1.
Thus, full optimization (i.e., promotion to Tier1) requires at least 60 invocations. It is worth noting as well that with OSR enabled, it might not actually be running in Tier0
, but rather an OSR method. However, the same promotion rules to Tier1 still apply with OSR.
Observed Outcomes
Depending on the duration of a single microbenchmark operation at Tier0, we see different outcomes:
- >250ms per operation: All 20 iterations remain in Tier0/OSR.
- 80ms – 250ms: A mix of Tier0/OSR and PGO Instrumentation (leading to a bimodal distribution).
- 50ms – 80ms: A mix of Tier0/OSR, PGO Instrumentation, and Tier1 (resulting in a trimodal distribution).
- <50ms: Likely to be promoted to Tier 1 during warmup.
For some benchmarks, we attempted to address this by setting a warmup count to try and ensure promotion to Tier1, but many of these are no longer working with Dynamic PGO. A good example is Base64Tests where we are using [WarmupCount(30)]
, which worked previous when we only had Tiered Compilation.
Impact on Benchmark Measurements
Below are two examples that show the impact of this problem and how it causes false improvements or regressions, and how it can cause the appearance of multimodal distributions
-
LinqBenchmarks.Where00ForX:
- This test's running time sits quite close to the threshold for whether BenchmarkDotNet decides to run 1 or 2 operations per iteration. Depending on which of these it happens to hit, causes a different mix of tiers and makes it look like a regression or improvement has occurred.
- Below you can see two graphs which show the history for this benchmark over the last 90 days where you can see that the improvements and regressions correlate strongly with whether 1 or 2 operations per iteration were performed.
-
Benchstone.BenchF.NewtR.Test:
- One run of this benchmark clearly shows a trimodal performance distribution across 20 iterations, reflecting the contributions of Tier0/OSR, PGO Instrumentation, and Tier1.
I ran some queries on Kusto to estimate the number of unique tests that fall into the three buckets on Windows x64 CoreCLR runs over the last 10 days:
- >250ms: 157 tests
- 80ms – 250ms: 30 tests
- 50ms – 80ms: 105 tests
This query just identified tests that may be impacted but I have not analysed how much they have actually impacted the performance results. If we have multiple benchmarks in the same class that make a call to the same method, then that method might be in Tier0/OSR for the first benchmark and Tier1 for the second benchmark, this is also likely to be a source of noisiness in the results.
Potential Solutions
Here are some potential solutions to address the issue I came up with but I expect that the folks who work on the JIT might have better solutions:
- Ensure that all methods are invoked at least 60 times during warmup/JITting.
- This might not be future-proof if the JIT behavior changes
- This could significantly increase the time it takes to run all the benchmarks
- Modify the warmup or JIT phase of BenchmarkDotNet to programmatically detect whether the code has reached Tier1.
- I don't know enough about the JIT to know if this is possible or practical
- Use
AggressiveOptimization
on longer tests and increase warmup count on shorter tests- This approach adds significant maintenance overhead as we get new performance tests or the underlying performance changes over time
- With
AggressiveOptimization
enabled we won't be able to measure the impact of Tiered Compilation or Dynamic PGO