Skip to content

Determine and expose cluster health #2029

Closed
@vishalbollu

Description

@vishalbollu

Add a new command and/or a new section to cortex cluster info that aggregates the health of Cortex processes.

A user might have the perception that everything it's okay with the cluster when in fact a specific component might be failing silently. An example would be prometheus not being deployed correctly and therefore preventing the autoscaler and grafana from working correctly.

Here are a few resources that can be scanned to determine overall cluster health.

  • verify that all of the critical Cortex pods are running
  • batch, task crons should be running as expected
  • operator
  • prometheus
  • grafana
  • autoscaler
  • cluster autoscaler
  • events in istio resources such as the service and loadbalancer

API autoscaler crons can be rolled into their respective API statuses.

One potential design can be:

cortex cluster status

# operator: live
# prometheus: live
# grafana: live
# autoscaler: live
# (...)

Metadata

Metadata

Assignees

Labels

researchDetermine technical constraintstimecappedAssigned a limited amount of time

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions