[User Story] CI Health: Redefining CI investigations and Health

# [User Story] CI Health: Redefining CI investigations and Merge on Green

The purpose of this issue is to document the different work streams happening throughout the runtime to improve the UX and reliability
of the CI system. The main goal is to help developers feel productive while maintaining product risk low. The aim of this project is to
achieve a system where 80%+ off all PRs are merged with a green "Build Analysis" check, with all issues understood, and aiming to lower
the time wasted in repetitive investigation of known issues.


The work streams are roughly:

- [x] 1. Issues can be easily searched for throughout the different components of a PR to reason about failures:

  - [x] 1.1 Build issue search within AzDO has been deployed.
  - [x] 1.2 **[Owner: DevWF]** Helix test log searching. Rolled out and the tab identifies issue, but issue counts are not accurate yet and doesn't properly update a failure table on the tracking issue.

- [x] 2. It's easy to report issues directly from the `Build Analysis` check tab:

  - [x] 2.1 Build issues from are reported are easy to report as infrastructure issues for issues like AzDO feeds with retries capability.
  - [x] 2.2 Test issues are easy to report from the failed build. This includes all relevant information and all the end user has to do is provide identifiable information for automation to find the correct issue.
  - [x] 2.3 **[Owner: DevWF]** Issue should contain all accurate accounting of occurrences as this helps teams prioritize impactful issues. We are missing for the table to have source - i.e. PR backlink - and an accurate count of hits on a sliding window.

- [ ] 3. Update docs to account for opening issues, assessing if an issue is known, and how to proceed if issues are found:

  - [ ] 3.1 **[Owner: Runtime]** https://github.com/dotnet/runtime/pull/74615 largely achieved this work, but it needs to be updated for the opening issues workflow that got enabled as well as some of the timing expectations for the system.

- [x] 4. Tests should have failures logged in a format that Build Analysis can easily reasoned about and surfaced to the check tab:

  - [x] 4.1 **[Owner: Runtime/DevWF]** Ensure that the xUnit based tests properly surface asserts to the generated wrappers. That is, reap StdErr for the child process as much as possible so that the check tab can show the appropriate information.
    - [x] 4.1.1 https://github.com/dotnet/runtime/issues/77918
  - [x] 4.2 **[Owner: Runtime]** Ensure the new source-generated testing framework allows for proper attribution at the test level and that all tests are surfaced in a way the engineering system can shortcut people's workflow. This includes an analysis of catastrophe style issues that are now reported as workitem failures. @davidwrighton was taking a cursory look at this.
    - [x] 4.2.1 https://github.com/dotnet/runtime/issues/77735
  - [ ] ~4.3 **[Moved to Future Item]** **[Owner: Runtime]** Ensure timeouts and hang dumps are properly handled in the new testing system, and that they are surfaced in a way build analysis can upload them.~

- [ ] 5. Redefine merge on red: Make build analysis the definition for merge on red
  - [ ] 5.1 **[Owner: DevWF]** Turning 'Build Analysis' into a required check requires:
      - [x] 5.1.1 Reporting an issue should rerun the check against it to move it to the known column.
      - [ ] ~5.1.2 Correlating an issue manually is possible (even if undesirable) to unblock merging.~
      - [x] 5.1.3 Re-run a check is necessary to some extent - otherwise PRs need to wait 1+ hours for DWV to rerun.
  - [x] 5.2 **[Owner: Runtime/DevWF]** Define a metric that measures how successful this new definition is at helping people quickly distinguish errors from their PRs from known issues.
  - [ ] 5.3 **[Owner: Runtime]** Find a way to help people discover this definition easily: if all known issues, it should be obvious to the end user they can merge. **Specify in documentation to mark this as completed.**
  - [x] 5.4 **[Owner: Runtime/DevWF]** Define a mechanism to study what failures need hardening and what issues should be invested on. The dashboard could surface 

# Future Work

- [ ] Adding crashdump and hang dump in Libraries tests
- [ ] V1 & V2 test system: Enable crash collection on macOS (e.g., Singlefile, exception handling)
- [ ] V2 test system: Hang dump collection and integrate symbolication from V1
  - 4.3 **[Moved to Future Item]** **[Owner: Runtime]** Ensure timeouts and hang dumps are properly handled in the new testing system, and that they are surfaced in a way build analysis can upload them.
- [ ] Move from ASP.NET to dotnet/arcade (repo with all the shared infrastructure) for test level retry
- [ ] No crashdump and hang dump support for mono and wasm. They don't have a good crash mechanism yet. @SamMonoRT @lewing @BrzVlad


cc: @JulieLeeMSFT @tommcdon @markwilkie 

cc: @AlitzelMendez @missymessa @ulisesh @ChadNedzlek 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[User Story] CI Health: Redefining CI investigations and Health #75243

[User Story] CI Health: Redefining CI investigations and Merge on Green

Future Work

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development