Skip to content

[User Story] CI Health: Redefining CI investigations and Health #75243

Open
@hoyosjs

Description

[User Story] CI Health: Redefining CI investigations and Merge on Green

The purpose of this issue is to document the different work streams happening throughout the runtime to improve the UX and reliability
of the CI system. The main goal is to help developers feel productive while maintaining product risk low. The aim of this project is to
achieve a system where 80%+ off all PRs are merged with a green "Build Analysis" check, with all issues understood, and aiming to lower
the time wasted in repetitive investigation of known issues.

The work streams are roughly:

  • 1. Issues can be easily searched for throughout the different components of a PR to reason about failures:

    • 1.1 Build issue search within AzDO has been deployed.
    • 1.2 [Owner: DevWF] Helix test log searching. Rolled out and the tab identifies issue, but issue counts are not accurate yet and doesn't properly update a failure table on the tracking issue.
  • 2. It's easy to report issues directly from the Build Analysis check tab:

    • 2.1 Build issues from are reported are easy to report as infrastructure issues for issues like AzDO feeds with retries capability.
    • 2.2 Test issues are easy to report from the failed build. This includes all relevant information and all the end user has to do is provide identifiable information for automation to find the correct issue.
    • 2.3 [Owner: DevWF] Issue should contain all accurate accounting of occurrences as this helps teams prioritize impactful issues. We are missing for the table to have source - i.e. PR backlink - and an accurate count of hits on a sliding window.
  • 3. Update docs to account for opening issues, assessing if an issue is known, and how to proceed if issues are found:

  • 4. Tests should have failures logged in a format that Build Analysis can easily reasoned about and surfaced to the check tab:

  • 5. Redefine merge on red: Make build analysis the definition for merge on red

    • 5.1 [Owner: DevWF] Turning 'Build Analysis' into a required check requires:
      • 5.1.1 Reporting an issue should rerun the check against it to move it to the known column.
      • 5.1.2 Correlating an issue manually is possible (even if undesirable) to unblock merging.
      • 5.1.3 Re-run a check is necessary to some extent - otherwise PRs need to wait 1+ hours for DWV to rerun.
    • 5.2 [Owner: Runtime/DevWF] Define a metric that measures how successful this new definition is at helping people quickly distinguish errors from their PRs from known issues.
    • 5.3 [Owner: Runtime] Find a way to help people discover this definition easily: if all known issues, it should be obvious to the end user they can merge. Specify in documentation to mark this as completed.
    • 5.4 [Owner: Runtime/DevWF] Define a mechanism to study what failures need hardening and what issues should be invested on. The dashboard could surface

Future Work

  • Adding crashdump and hang dump in Libraries tests
  • V1 & V2 test system: Enable crash collection on macOS (e.g., Singlefile, exception handling)
  • V2 test system: Hang dump collection and integrate symbolication from V1
    • 4.3 [Moved to Future Item] [Owner: Runtime] Ensure timeouts and hang dumps are properly handled in the new testing system, and that they are surfaced in a way build analysis can upload them.
  • Move from ASP.NET to dotnet/arcade (repo with all the shared infrastructure) for test level retry
  • No crashdump and hang dump support for mono and wasm. They don't have a good crash mechanism yet. @SamMonoRT @lewing @BrzVlad

cc: @JulieLeeMSFT @tommcdon @markwilkie

cc: @AlitzelMendez @missymessa @ulisesh @ChadNedzlek

Metadata

Assignees

No one assigned

    Labels

    User StoryA single user-facing feature. Can be grouped under an epic.area-Infrastructure

    Type

    No type

    Projects

    • Status

      No status

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions