Skip to content

Openverse On Call Policies and Notifications #4719

Open

Description

Description

@WordPress/openverse-maintainers need to develop a comprehensive on-call plan for service disruptions during unstaffed hours.

Requirements

  • Schedule: "whoever is in working hours right now" followed by a designated individual on-call for the week. If no one is in working hours right now, immediately notify the person on-call.
  • Address catastrophic-level outages to avoid noise and undue disruption to on-call staff.

On Call team

When critical alarms fire during unstaffed hours, we will rotate through pinging @zackkrida, @AetherUnbound, and @sarayourfriend.

Initial Alarms

  • EC2 instance deprecation/decommission: Notifications for EC2 instances must be explicitly read and acknowledged by someone on-call, regardless of the instance type.
  • Zero responses from public services: Uptime monitoring for the API and frontend to ensure immediate attention if there are no responses.
  • Production Elasticsearch node numbers out of threshold: Monitoring the number of production Elasticsearch nodes to ensure they stay within the expected thresholds.
  • RDS outage: Alerts for RDS outages are necessary for understanding the cause of potential public service issues.
  • ElastiCache outage: Similar to RDS, monitoring ElastiCache outages for their potential impact on public services.
  • Greater than X% of 5xx responses from public services: Monitoring for a high percentage of 5xx responses as an indicator of significant service issues.

Documents

In lieu of a project proposal see this project description for the rationale and scope of work.

  • Possible Implementation Plans
    • On Call schedule and configuration in Grafana On Call
    • Alarm creation and revisions

Milestones/Issues

Prior Art

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

♿️ aspect: a11yConcerns related to the project's accessibility🟧 priority: highStalls work on the project or its dependents🤖 aspect: dxConcerns developers' experience with the codebase🧭 project: threadAn issue used to track a project and its progress🧱 stack: infraRelated to the Terraform config and other infrastructure

Type

No type

Projects

  • Status

    ⏸ On Hold

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions