Skip to content

Add metrics to properly monitor validation limits #1433

Closed
@weeco

Description

@weeco

Use case

When running Cortex for multiple tenants (e. g. multiple departments within a larger company) one still wants to set reasonable validation limits to ensure a healthy Cortex cluster.

The problem

Due to the fact that all limits multiply with the number of distributor/ingester replicas it's not easy to understand the current validation limits. Additionally overriding the limits for specific users makes it more complex.

It would be very nice if I could monitor all validation limits (and their usage) per user, so that I know which user is close to it's validation limits. This way I can reach out to users ask them if they need higher limits, before they run into rejections which would cause lost metrics.

Proposed solution

We could add prometheus metricses for all limits which we can easily sum in a Grafana dashboard. This way we know the exact validation limits (e. g. total ingestion rate and burst size per user). This might be a problem for those who run Cortex for hundreds of users I assume as each user means a new metric series.

Metadata

Metadata

Assignees

Labels

help wantedtype/observabilityTo help know what is going on inside Cortextype/productionIssues related to the production use of Cortex, inc. configuration, alerting and operating.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions