Many microbenchmarks are not being measured with full optimization

## Description

This issue addresses the behavior of our default job for running microbenchmarks in this repository. With our current settings, the warmup phase may not always reach full optimization (Tier1) and produces incorrect results.

## Current Configuration

The default job uses the following settings (see [RecommendedConfig.cs#L36-L39](https://github.com/dotnet/performance/blob/main/src/harness/BenchmarkDotNet.Extensions/RecommendedConfig.cs#L36-L39)):

- **Warmup Count:** 1
- **Iteration Time:** 250ms
- **Min Iteration Count:** 15
- **Max Iteration Count:** 20

In addition, *Tiered Compilation* and *Dynamic PGO* are enabled by default (except for our `nodynamicpgo` jobs).

## Background on Dynamic PGO

According to the documentation on tier promotion with Dynamic PGO ([link](https://github.com/dotnet/runtime/blob/main/docs/design/features/DynamicPgo-InstrumentedTiers.md)): 

- The first **30** method invocations run at **Tier0**.
- The next **30** method invocations use **Tier0Instrumented**.
- Only after these 60 invocations does the method get promoted to **Tier1**.

Thus, full optimization (i.e., promotion to Tier1) requires at least **60 invocations**. It is worth noting as well that with OSR enabled, it might not actually be running in `Tier0`, but rather an OSR method. However, the same promotion rules to Tier1 still apply with OSR.

## Observed Outcomes

Depending on the duration of a single microbenchmark operation at Tier0, we see different outcomes:

- **>250ms per operation:** All 20 iterations remain in Tier0/OSR.
- **80ms – 250ms:** A mix of Tier0/OSR and PGO Instrumentation (leading to a bimodal distribution).
- **50ms – 80ms:** A mix of Tier0/OSR, PGO Instrumentation, and Tier1 (resulting in a trimodal distribution).
- **<50ms**: Likely to be promoted to Tier 1 during warmup.

For some benchmarks, we attempted to address this by setting a warmup count to try and ensure promotion to Tier1, but many of these are no longer working with Dynamic PGO. A good example is [Base64Tests](https://github.com/dotnet/performance/blob/37a1fc3ca4b89b07a6766e9fbf9919d8645b7e59/src/benchmarks/micro/libraries/System.Buffers/Base64Tests.cs#L102) where we are using `[WarmupCount(30)]`, which worked previous when we only had Tiered Compilation.

## Impact on Benchmark Measurements

Below are two examples that show the impact of this problem and how it causes false improvements or regressions, and how it can cause the appearance of multimodal distributions

1. **LinqBenchmarks.Where00ForX:**  
   - This test's running time sits quite close to the threshold for whether BenchmarkDotNet decides to run 1 or 2 operations per iteration. Depending on which of these it happens to hit, causes a different mix of tiers and makes it look like a regression or improvement has occurred.
   - Below you can see two graphs which show the history for this benchmark over the last 90 days where you can see that the improvements and regressions correlate strongly with whether 1 or 2 operations per iteration were performed.
   
   ![Performance Graph](https://github.com/user-attachments/assets/49056c02-076c-4381-a5b5-ef9baf2e8a91)
   
   ![Operations Graph](https://github.com/user-attachments/assets/62c436ea-9f1e-4c64-be0b-eeb4b30feded)

2. **Benchstone.BenchF.NewtR.Test:**  
   - One run of this benchmark clearly shows a trimodal performance distribution across 20 iterations, reflecting the contributions of Tier0/OSR, PGO Instrumentation, and Tier1.

   ![Tier Performance Graph](https://github.com/user-attachments/assets/8d0840de-71c7-4e71-9a0a-6e023bf5854e)

I ran some queries on Kusto to estimate the number of unique tests that fall into the three buckets on Windows x64 CoreCLR runs over the last 10 days:
- **>250ms:** 157 tests
- **80ms – 250ms:** 30 tests
- **50ms – 80ms:** 105 tests

This query just identified tests that may be impacted but I have not analysed how much they have actually impacted the performance results. If we have multiple benchmarks in the same class that make a call to the same method, then that method might be in Tier0/OSR for the first benchmark and Tier1 for the second benchmark, this is also likely to be a source of noisiness in the results.

## Potential Solutions

Here are some potential solutions to address the issue I came up with but I expect that the folks who work on the JIT might have better solutions:

1. Ensure that all methods are invoked at least 60 times during warmup/JITting.
    - This might not be future-proof if the JIT behavior changes
    - This could significantly increase the time it takes to run all the benchmarks
2. Modify the warmup or JIT phase of BenchmarkDotNet to programmatically detect whether the code has reached Tier1.
    - I don't know enough about the JIT to know if this is possible or practical
3. Use `AggressiveOptimization` on longer tests and increase warmup count on shorter tests
   - This approach adds significant maintenance overhead as we get new performance tests or the underlying performance changes over time
   - With `AggressiveOptimization` enabled we won't be able to measure the impact of Tiered Compilation or Dynamic PGO

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Many microbenchmarks are not being measured with full optimization #4736

Description

Current Configuration

Background on Dynamic PGO

Observed Outcomes

Impact on Benchmark Measurements

Potential Solutions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Many microbenchmarks are not being measured with full optimization #4736

Description

Description

Current Configuration

Background on Dynamic PGO

Observed Outcomes

Impact on Benchmark Measurements

Potential Solutions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions