Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health check for too many metrics #2446

Closed
corverroos opened this issue Jul 18, 2023 · 2 comments
Closed

Health check for too many metrics #2446

corverroos opened this issue Jul 18, 2023 · 2 comments
Assignees
Labels
protocol Protocol Team tickets V1

Comments

@corverroos
Copy link
Contributor

corverroos commented Jul 18, 2023

🎯 Problem to be solved

We have had a problem of prometheus metrics using to many labels (high cardinality) which results in memory leak in charon and overloading of the prometheus infra.

We need to monitor and alert when metric labels grows too much.

🛠️ Proposed solution

Two step solution:

  • Add a warning health-check that alerts if there are more than X*validators metrics in a metrics family. Suggest using X=100 as a start.
  • Add a central monitoring pager duty alert (similar to high memory usage alet) for this check
@github-actions github-actions bot added the protocol Protocol Team tickets label Jul 18, 2023
@boulder225 boulder225 added the V1 label Mar 1, 2024
@pinebit
Copy link
Contributor

pinebit commented Mar 14, 2024

Please add your planning poker estimate with Zenhub @gsora

@pinebit
Copy link
Contributor

pinebit commented Mar 14, 2024

Please add your planning poker estimate with Zenhub @KaloyanTanev

@pinebit pinebit self-assigned this Mar 14, 2024
obol-bulldozer bot pushed a commit that referenced this issue Mar 16, 2024
This is to monitor metrics high cardinality, when a given metric gets too many distinct labels which results in high memory consumption in charon.
To this end this PR introduces one new metric: `app_health_metrics_high_cardinality` (gauge) which renders metric names => max labels count, if this metric exceeded the threshold (100 * num_of_validators).
In addition to the above metric which can be used in alerting, this also adds a new "health check" named `metrics_high_cardinality` which is triggered when `app_health_metrics_high_cardinality` reported any offense.

Note: `app_health_metrics_high_cardinality` will not "reset", because the purpose of this feature is to detect and signal an opportunity of memory leak.

category: feature
ticket: #2446
@pinebit pinebit closed this as completed Mar 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
protocol Protocol Team tickets V1
Projects
None yet
Development

No branches or pull requests

3 participants