Description
Is your feature request related to a problem?
This is a META issue which will act as a container to house all things Root Cause Analysis. The issue will be updated with next steps based on feedback from the linked RFC.
What solution would you like?
Site reliability or developer operations groups troubleshoot outages which are complex in nature. The teams use mean time to detection (MTTD) and recovery (MTTR) as key performance indicators (KPIs). Most outage complications are solved by a single development team, but sometimes it can take days to figure out what is going wrong with the system across many development teams. Root cause analysis guides users to a subset of application metrics which are likely involved in the problem for further review.
What alternatives have you considered?
To be filled in as details are added.
Do you have any additional context?
TBD