Skip to content

feat: Emit per-tenant limit overrides as metrics #3785

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Feb 9, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
- `-<prefix>.s3.sse.kms-key-id`
- `-<prefix>.s3.sse.kms-encryption-context`
* [FEATURE] Querier: Enable `@ <timestamp>` modifier in PromQL using the new `-querier.at-modifier-enabled` flag. #3744
* [FEATURE] Overrides Exporter: Add `overrides-exporter` module for exposing per-tenant resource limit overrides as metrics. It is not included in `all` target, and must be explicitly enabled. #3785
* [ENHANCEMENT] Ruler: Add TLS and explicit basis authentication configuration options for the HTTP client the ruler uses to communicate with the alertmanager. #3752
* `-ruler.alertmanager-client.basic-auth-username`: Configure the basic authentication username used by the client. Takes precedent over a URL configured username.
* `-ruler.alertmanager-client.basic-auth-password`: Configure the basic authentication password used by the client. Takes precedent over a URL configured password.
Expand Down
1 change: 1 addition & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -222,6 +222,7 @@ doc: clean-doc
go run ./tools/doc-generator ./docs/blocks-storage/querier.template > ./docs/blocks-storage/querier.md
embedmd -w docs/operations/requests-mirroring-to-secondary-cluster.md
embedmd -w docs/configuration/single-process-config.md
embedmd -w docs/guides/overrides-exporter.md

endif

Expand Down
12 changes: 12 additions & 0 deletions docs/guides/overrides-exporter-runtime.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# file: runtime.yaml
# In this example, we're overriding ingestion limits for a single tenant.
overrides:
"user1":
ingestion_burst_size: 350000
ingestion_rate: 350000
max_global_series_per_metric: 300000
max_global_series_per_user: 300000
max_series_per_metric: 0
max_series_per_user: 0
max_samples_per_query: 100000
max_series_per_query: 100000
66 changes: 66 additions & 0 deletions docs/guides/overrides-exporter.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
---
title: "Overrides Exporter"
linkTitle: "Overrides Exporter"
weight: 10
slug: overrides-exporter
---

Since Cortex is a multi-tenant system, it supports applying limits to each tenant to prevent
any single one from using too many resources. In order to help operators understand how close
to their limits tenants are, the `overrides-exporter` module can expose limits as Prometheus metrics.

## Context

To update configuration without restarting, Cortex allows operators to supply a `runtime_config`
file that will be periodically reloaded. This file can be specified under the `runtime_config` section
of the main [configuration file](../configuration/arguments.md#runtime-configuration-file) or using the `-runtime-config.file`
command line flag. This file is used to apply tenant-specific limits.

## Example

The `overrides-exporter` is not enabled by default, it must be explicitly enabled. We recommend
only running a single instance of it in your cluster due to the cardinality of the metrics
emitted.

With a `runtime.yaml` file given below

[embedmd]:# (./overrides-exporter-runtime.yaml)
```yaml
# file: runtime.yaml
# In this example, we're overriding ingestion limits for a single tenant.
overrides:
"user1":
ingestion_burst_size: 350000
ingestion_rate: 350000
max_global_series_per_metric: 300000
max_global_series_per_user: 300000
max_series_per_metric: 0
max_series_per_user: 0
max_samples_per_query: 100000
max_series_per_query: 100000
```

The `overrides-exporter` is configured to run as follows

```
cortex -target overrides-exporter -runtime-config.file runtime.yaml -server.http-listen-port=8080
```

After the `overrides-exporter` starts, you can to use `curl` to inspect the tenant overrides.

```text
curl -s http://localhost:8080/metrics | grep cortex_overrides
# HELP cortex_overrides Resource limit overrides applied to tenants
# TYPE cortex_overrides gauge
cortex_overrides{limit_name="ingestion_burst_size",user="user1"} 350000
cortex_overrides{limit_name="ingestion_rate",user="user1"} 350000
cortex_overrides{limit_name="max_global_series_per_metric",user="user1"} 300000
cortex_overrides{limit_name="max_global_series_per_user",user="user1"} 300000
cortex_overrides{limit_name="max_local_series_per_metric",user="user1"} 0
cortex_overrides{limit_name="max_local_series_per_user",user="user1"} 0
cortex_overrides{limit_name="max_samples_per_query",user="user1"} 100000
cortex_overrides{limit_name="max_series_per_query",user="user1"} 100000
```

With these metrics, you can set up alerts to know when tenants are close to hitting their limits
before they exceed them.
20 changes: 20 additions & 0 deletions pkg/cortex/modules.go
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ const (
Ring string = "ring"
RuntimeConfig string = "runtime-config"
Overrides string = "overrides"
OverridesExporter string = "overrides-exporter"
Server string = "server"
Distributor string = "distributor"
DistributorService string = "distributor-service"
Expand Down Expand Up @@ -170,6 +171,23 @@ func (t *Cortex) initOverrides() (serv services.Service, err error) {
return nil, err
}

func (t *Cortex) initOverridesExporter() (services.Service, error) {
supplier := tenantLimitsRuntimeConfigFunc(t.RuntimeConfig)
if t.Cfg.isModuleEnabled(OverridesExporter) && supplier == nil {
// This target isn't enabled by default ("all") and requires runtime configuration
// to work. Fail if it can't be setup correctly since the user explicitly wanted this
// target to run.
return nil, errors.New("overrides-exporter has been enabled, but no runtime configuration file was configured")
}

exporter := validation.NewOverridesExporter(supplier)
prometheus.MustRegister(exporter)

// the overrides exporter has no state and reads overrides for runtime configuration each time it
// is collected so there is no need to return any service
return nil, nil
}

func (t *Cortex) initDistributorService() (serv services.Service, err error) {
t.Cfg.Distributor.DistributorRing.ListenPort = t.Cfg.Server.GRPCListenPort
t.Cfg.Distributor.ShuffleShardingLookbackPeriod = t.Cfg.Querier.ShuffleShardingIngestersLookbackPeriod
Expand Down Expand Up @@ -791,6 +809,7 @@ func (t *Cortex) setupModuleManager() error {
mm.RegisterModule(MemberlistKV, t.initMemberlistKV, modules.UserInvisibleModule)
mm.RegisterModule(Ring, t.initRing, modules.UserInvisibleModule)
mm.RegisterModule(Overrides, t.initOverrides, modules.UserInvisibleModule)
mm.RegisterModule(OverridesExporter, t.initOverridesExporter)
mm.RegisterModule(Distributor, t.initDistributor)
mm.RegisterModule(DistributorService, t.initDistributorService, modules.UserInvisibleModule)
mm.RegisterModule(Store, t.initChunkStore, modules.UserInvisibleModule)
Expand Down Expand Up @@ -824,6 +843,7 @@ func (t *Cortex) setupModuleManager() error {
RuntimeConfig: {API},
Ring: {API, RuntimeConfig, MemberlistKV},
Overrides: {RuntimeConfig},
OverridesExporter: {RuntimeConfig},
Distributor: {DistributorService, API},
DistributorService: {Ring, Overrides},
Store: {Overrides, DeleteRequestsStore},
Expand Down
25 changes: 21 additions & 4 deletions pkg/cortex/runtime_config.go
Original file line number Diff line number Diff line change
Expand Up @@ -45,17 +45,34 @@ func loadRuntimeConfig(r io.Reader) (interface{}, error) {
return overrides, nil
}

// tenantLimitsFromRuntimeConfig returns a function that translates tenant IDs to
// specific limits for that tenant if configured, nil otherwise
func tenantLimitsFromRuntimeConfig(c *runtimeconfig.Manager) validation.TenantLimits {
if c == nil {
return nil
}

supplier := tenantLimitsRuntimeConfigFunc(c)
return func(userID string) *validation.Limits {
cfg, ok := c.GetConfig().(*runtimeConfigValues)
if !ok || cfg == nil {
return nil
tenantLimits := supplier()
return tenantLimits[userID]
}
}

// tenantLimitsRuntimeConfigFunc returns a function that returns a mapping of all
// tenant specific overrides as it is updated via runtime configuration
func tenantLimitsRuntimeConfigFunc(manager *runtimeconfig.Manager) func() map[string]*validation.Limits {
if manager == nil {
return nil
}

return func() map[string]*validation.Limits {
val := manager.GetConfig()
if cfg, ok := val.(*runtimeConfigValues); ok && cfg != nil {
return cfg.TenantLimits
}

return cfg.TenantLimits[userID]
return nil
}
}

Expand Down
44 changes: 44 additions & 0 deletions pkg/util/validation/exporter.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
package validation

import (
"github.com/prometheus/client_golang/prometheus"
)

// OverridesExporter exposes per-tenant resource limit overrides as Prometheus metrics
type OverridesExporter struct {
limitSupplier func() map[string]*Limits
description *prometheus.Desc
}

// NewOverridesExporter creates an OverridesExporter that reads updates to per-tenant
// limits using the provided function.
func NewOverridesExporter(limitSupplier func() map[string]*Limits) *OverridesExporter {
return &OverridesExporter{
limitSupplier: limitSupplier,
description: prometheus.NewDesc(
"cortex_overrides",
"Resource limit overrides applied to tenants",
[]string{"limit_name", "user"},
nil,
),
}
}

func (oe *OverridesExporter) Describe(ch chan<- *prometheus.Desc) {
ch <- oe.description
}

func (oe *OverridesExporter) Collect(ch chan<- prometheus.Metric) {
allLimits := oe.limitSupplier()
for tenant, limits := range allLimits {
ch <- prometheus.MustNewConstMetric(oe.description, prometheus.GaugeValue, limits.IngestionRate, "ingestion_rate", tenant)
ch <- prometheus.MustNewConstMetric(oe.description, prometheus.GaugeValue, float64(limits.IngestionBurstSize), "ingestion_burst_size", tenant)

ch <- prometheus.MustNewConstMetric(oe.description, prometheus.GaugeValue, float64(limits.MaxSeriesPerQuery), "max_series_per_query", tenant)
ch <- prometheus.MustNewConstMetric(oe.description, prometheus.GaugeValue, float64(limits.MaxSamplesPerQuery), "max_samples_per_query", tenant)
ch <- prometheus.MustNewConstMetric(oe.description, prometheus.GaugeValue, float64(limits.MaxLocalSeriesPerUser), "max_local_series_per_user", tenant)
ch <- prometheus.MustNewConstMetric(oe.description, prometheus.GaugeValue, float64(limits.MaxLocalSeriesPerMetric), "max_local_series_per_metric", tenant)
ch <- prometheus.MustNewConstMetric(oe.description, prometheus.GaugeValue, float64(limits.MaxGlobalSeriesPerUser), "max_global_series_per_user", tenant)
ch <- prometheus.MustNewConstMetric(oe.description, prometheus.GaugeValue, float64(limits.MaxGlobalSeriesPerMetric), "max_global_series_per_metric", tenant)
}
}
30 changes: 30 additions & 0 deletions pkg/util/validation/exporter_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
package validation

import (
"testing"

"github.com/prometheus/client_golang/prometheus/testutil"
"github.com/stretchr/testify/assert"
)

func TestOverridesExporter_noConfig(t *testing.T) {
limitSupplier := func() map[string]*Limits { return nil }
exporter := NewOverridesExporter(limitSupplier)

// With no updated override configurations, there should be no override metrics
count := testutil.CollectAndCount(exporter, "cortex_overrides")
assert.Equal(t, 0, count)
}

func TestOverridesExporter_withConfig(t *testing.T) {
limitSupplier := func() map[string]*Limits {
cfg := make(map[string]*Limits)
cfg["user1"] = &Limits{}
return cfg
}
exporter := NewOverridesExporter(limitSupplier)

// There should be at least a few metrics generated by receiving an override configuration map
count := testutil.CollectAndCount(exporter, "cortex_overrides")
assert.Greater(t, count, 0)
}