Open
Description
openedon Aug 5, 2024
Description
@WordPress/openverse-maintainers need to develop a comprehensive on-call plan for service disruptions during unstaffed hours.
Requirements
- Schedule: "whoever is in working hours right now" followed by a designated individual on-call for the week. If no one is in working hours right now, immediately notify the person on-call.
- Address catastrophic-level outages to avoid noise and undue disruption to on-call staff.
On Call team
When critical alarms fire during unstaffed hours, we will rotate through pinging @zackkrida, @AetherUnbound, and @sarayourfriend.
Initial Alarms
- EC2 instance deprecation/decommission: Notifications for EC2 instances must be explicitly read and acknowledged by someone on-call, regardless of the instance type.
- Zero responses from public services: Uptime monitoring for the API and frontend to ensure immediate attention if there are no responses.
- Production Elasticsearch node numbers out of threshold: Monitoring the number of production Elasticsearch nodes to ensure they stay within the expected thresholds.
- RDS outage: Alerts for RDS outages are necessary for understanding the cause of potential public service issues.
- ElastiCache outage: Similar to RDS, monitoring ElastiCache outages for their potential impact on public services.
- Greater than X% of 5xx responses from public services: Monitoring for a high percentage of 5xx responses as an indicator of significant service issues.
Documents
In lieu of a project proposal see this project description for the rationale and scope of work.
- Possible Implementation Plans
- On Call schedule and configuration in Grafana On Call
- Alarm creation and revisions
Milestones/Issues
Prior Art
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Metadata
Assignees
Labels
Type
Projects
Status
⏸ On Hold