Closed
Description
Add a new command and/or a new section to cortex cluster info
that aggregates the health of Cortex processes.
A user might have the perception that everything it's okay with the cluster when in fact a specific component might be failing silently. An example would be prometheus not being deployed correctly and therefore preventing the autoscaler and grafana from working correctly.
Here are a few resources that can be scanned to determine overall cluster health.
- verify that all of the critical Cortex pods are running
- batch, task crons should be running as expected
- operator
- prometheus
- grafana
- autoscaler
- cluster autoscaler
- events in istio resources such as the service and loadbalancer
API autoscaler crons can be rolled into their respective API statuses.
One potential design can be:
cortex cluster status
# operator: live
# prometheus: live
# grafana: live
# autoscaler: live
# (...)