Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Grafana dashboard's resilience! #3538

Closed
balajialg opened this issue Aug 9, 2022 · 3 comments
Closed

Improve Grafana dashboard's resilience! #3538

balajialg opened this issue Aug 9, 2022 · 3 comments
Assignees
Labels

Comments

@balajialg
Copy link
Contributor

balajialg commented Aug 9, 2022

Bug description

Grafana dashboard threw a 503 error when there was an outage across all the hubs today. Considering that @felder uses grafana hub metrics to measure the health across all hubs, it is important that grafana gets isolated from such outages.

Note: Grafana came back after all the hubs started operating. So, there is no issue with grafana as of now.

image (8)

Creating this issue to explore different design choices to avoid this scenario in the future.

Environment & setup

Grafana + Prometheus combination

How to reproduce

See the above description!

@balajialg
Copy link
Contributor Author

balajialg commented Nov 22, 2022

One of the discussion items we had from a short meeting today was steps to improve Grafana's resilience. @shaneknapp had an idea that it would make sense to create a new monitoring node pool and move our monitoring infrastructure which includes Grafana and Prometheus to this node pool. Through this approach, we will still have the monitoring infra work when the hubs part of the core node pool goes down.

@yuvipanda
Copy link
Contributor

You should consider https://cloud.google.com/managed-prometheus too

@balajialg
Copy link
Contributor Author

balajialg commented Feb 1, 2023

Closing this issue as it is fixed by #3995.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants