Improve Grafana dashboard's resilience! #3538

balajialg · 2022-08-09T01:33:25Z

Bug description

Grafana dashboard threw a 503 error when there was an outage across all the hubs today. Considering that @felder uses grafana hub metrics to measure the health across all hubs, it is important that grafana gets isolated from such outages.

Note: Grafana came back after all the hubs started operating. So, there is no issue with grafana as of now.

Creating this issue to explore different design choices to avoid this scenario in the future.

Environment & setup

Grafana + Prometheus combination

How to reproduce

See the above description!

balajialg · 2022-11-22T02:27:10Z

One of the discussion items we had from a short meeting today was steps to improve Grafana's resilience. @shaneknapp had an idea that it would make sense to create a new monitoring node pool and move our monitoring infrastructure which includes Grafana and Prometheus to this node pool. Through this approach, we will still have the monitoring infra work when the hubs part of the core node pool goes down.

yuvipanda · 2022-11-22T02:52:22Z

You should consider https://cloud.google.com/managed-prometheus too

balajialg · 2023-02-01T01:24:17Z

Closing this issue as it is fixed by #3995.

balajialg added the bug label Aug 9, 2022

balajialg assigned felder and balajialg and unassigned balajialg Aug 9, 2022

balajialg assigned shaneknapp Nov 22, 2022

yuvipanda mentioned this issue Dec 8, 2022

Move all support stuff except nginx to support pool #3995

Merged

balajialg closed this as completed Feb 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Grafana dashboard's resilience! #3538

Improve Grafana dashboard's resilience! #3538

balajialg commented Aug 9, 2022 •

edited

Loading

balajialg commented Nov 22, 2022 •

edited

Loading

yuvipanda commented Nov 22, 2022

balajialg commented Feb 1, 2023 •

edited

Loading

Improve Grafana dashboard's resilience! #3538

Improve Grafana dashboard's resilience! #3538

Comments

balajialg commented Aug 9, 2022 • edited Loading

Bug description

Environment & setup

How to reproduce

balajialg commented Nov 22, 2022 • edited Loading

yuvipanda commented Nov 22, 2022

balajialg commented Feb 1, 2023 • edited Loading

balajialg commented Aug 9, 2022 •

edited

Loading

balajialg commented Nov 22, 2022 •

edited

Loading

balajialg commented Feb 1, 2023 •

edited

Loading