Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Mixins] Option for a custom cluster label #1651

Merged
merged 10 commits into from
Apr 13, 2022
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@
- `MimirContinuousTestNotRunningOnWrites`
- `MimirContinuousTestNotRunningOnReads`
- `MimirContinuousTestFailed`
* [ENHANCEMENT] Added `per_cluster_label` support to allow to change the label name used to differentiate between Kubernetes clusters. #1651
* [BUGFIX] Dashboards: Fix "Failed evaluation rate" panel on Tenants dashboard. #1629

### Jsonnet
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ The following table shows the required label names and whether they can be custo

| Label name | Configurable | Description |
| :---------- | :----------- | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `cluster` | No | The Kubernetes cluster or datacenter where the Mimir cluster is running. |
| `cluster` | Yes | The Kubernetes cluster or datacenter where the Mimir cluster is running. The cluster label can be configured with the `per_cluster_label` field in the mixin config. |
| `namespace` | No | The Kubernetes namespace where the Mimir cluster is running. |
| `job` | Partially | The Kubernetes namespace and Mimir component in the format `<namespace>/<component>`. When running in monolithic mode, the `<component>` should be `mimir`. When running in microservices mode, the `<component>` should be the name of the specific Mimir component (singular), like `distributor`, `ingester` or `store-gateway`. The label name can't be configured, while the regular expressions used to match components can be configured with the `job_names` field in the mixin config. |
| `pod` | Yes | The unique identifier of a Mimir replica (eg. Pod ID when running on Kubernetes). The label name can be configured with the `per_instance_label` field in the mixin config. |
Expand Down
2 changes: 1 addition & 1 deletion operations/mimir-mixin/alerts/alerts.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -435,7 +435,7 @@
alert: $.alertName('ProvisioningTooManyWrites'),
// 80k writes / s per ingester max.
expr: |||
avg by (%(alert_aggregation_labels)s) (cluster_namespace_%(per_instance_label)s:cortex_ingester_ingested_samples_total:rate1m) > 80e3
avg by (%(alert_aggregation_labels)s) (%(alert_aggregation_rule_prefix)s_%(per_instance_label)s:cortex_ingester_ingested_samples_total:rate1m) > 80e3
||| % $._config,
'for': '15m',
labels: {
Expand Down
6 changes: 3 additions & 3 deletions operations/mimir-mixin/alerts/blocks.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,13 @@
(max by(%(alert_aggregation_labels)s, %(per_instance_label)s) (thanos_objstore_bucket_last_successful_upload_time{job=~".+/ingester.*"}) > 0)
and
# Only if the ingester has ingested samples over the last 4h.
(max by(%(alert_aggregation_labels)s, %(per_instance_label)s) (max_over_time(cluster_namespace_%(per_instance_label)s:cortex_ingester_ingested_samples_total:rate1m[4h])) > 0)
(max by(%(alert_aggregation_labels)s, %(per_instance_label)s) (max_over_time(%(alert_aggregation_rule_prefix)s_%(per_instance_label)s:cortex_ingester_ingested_samples_total:rate1m[4h])) > 0)
and
# Only if the ingester was ingesting samples 4h ago. This protects from the case the ingester instance
# had ingested samples in the past, then no traffic was received for a long period and then it starts
# receiving samples again. Without this check, the alert would fire as soon as it gets back receiving
# samples, while the a block shipping is expected within the next 4h.
(max by(%(alert_aggregation_labels)s, %(per_instance_label)s) (max_over_time(cluster_namespace_%(per_instance_label)s:cortex_ingester_ingested_samples_total:rate1m[1h] offset 4h)) > 0)
(max by(%(alert_aggregation_labels)s, %(per_instance_label)s) (max_over_time(%(alert_aggregation_rule_prefix)s_%(per_instance_label)s:cortex_ingester_ingested_samples_total:rate1m[1h] offset 4h)) > 0)
||| % $._config,
labels: {
severity: 'critical',
Expand All @@ -37,7 +37,7 @@
expr: |||
(max by(%(alert_aggregation_labels)s, %(per_instance_label)s) (thanos_objstore_bucket_last_successful_upload_time{job=~".+/ingester.*"}) == 0)
and
(max by(%(alert_aggregation_labels)s, %(per_instance_label)s) (max_over_time(cluster_namespace_%(per_instance_label)s:cortex_ingester_ingested_samples_total:rate1m[4h])) > 0)
(max by(%(alert_aggregation_labels)s, %(per_instance_label)s) (max_over_time(%(alert_aggregation_rule_prefix)s_%(per_instance_label)s:cortex_ingester_ingested_samples_total:rate1m[4h])) > 0)
||| % $._config,
labels: {
severity: 'critical',
Expand Down
7 changes: 5 additions & 2 deletions operations/mimir-mixin/config.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,12 @@
overrides_exporter: 'overrides-exporter',
},

// The label used to differentiate between different Kubernetes clusters.
per_cluster_label: 'cluster',

// Grouping labels, to uniquely identify and group by {jobs, clusters}
job_labels: ['cluster', 'namespace', 'job'],
cluster_labels: ['cluster', 'namespace'],
job_labels: [$._config.per_cluster_label, 'namespace', 'job'],
cluster_labels: [$._config.per_cluster_label, 'namespace'],

cortex_p99_latency_threshold_seconds: 2.5,

Expand Down
48 changes: 24 additions & 24 deletions operations/mimir-mixin/dashboards/alertmanager.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,11 @@ local utils = import 'mixin-utils/utils.libsonnet';
})
.addPanel(
$.panel('Total alerts') +
$.statPanel('sum(cluster_job_%s:cortex_alertmanager_alerts:sum{%s})' % [$._config.per_instance_label, $.jobMatcher($._config.job_names.alertmanager)], format='short')
$.statPanel('sum(%s_job_%s:cortex_alertmanager_alerts:sum{%s})' % [$._config.per_cluster_label, $._config.per_instance_label, $.jobMatcher($._config.job_names.alertmanager)], format='short')
)
.addPanel(
$.panel('Total silences') +
$.statPanel('sum(cluster_job_%s:cortex_alertmanager_silences:sum{%s})' % [$._config.per_instance_label, $.jobMatcher($._config.job_names.alertmanager)], format='short')
$.statPanel('sum(%s_job_%s:cortex_alertmanager_silences:sum{%s})' % [$._config.per_cluster_label, $._config.per_instance_label, $.jobMatcher($._config.job_names.alertmanager)], format='short')
)
.addPanel(
$.panel('Tenants') +
Expand All @@ -29,11 +29,11 @@ local utils = import 'mixin-utils/utils.libsonnet';
$.queryPanel(
[
|||
sum(cluster_job:cortex_alertmanager_alerts_received_total:rate5m{%s})
sum(%s_job:cortex_alertmanager_alerts_received_total:rate5m{%s})
-
sum(cluster_job:cortex_alertmanager_alerts_invalid_total:rate5m{%s})
||| % [$.jobMatcher($._config.job_names.alertmanager), $.jobMatcher($._config.job_names.alertmanager)],
'sum(cluster_job:cortex_alertmanager_alerts_invalid_total:rate5m{%s})' % $.jobMatcher($._config.job_names.alertmanager),
sum(%s_job:cortex_alertmanager_alerts_invalid_total:rate5m{%s})
||| % [$._config.per_cluster_label, $.jobMatcher($._config.job_names.alertmanager), $._config.per_cluster_label, $.jobMatcher($._config.job_names.alertmanager)],
'sum(%s_job:cortex_alertmanager_alerts_invalid_total:rate5m{%s})' % [$._config.per_cluster_label, $.jobMatcher($._config.job_names.alertmanager)],
],
['success', 'failed']
)
Expand All @@ -46,11 +46,11 @@ local utils = import 'mixin-utils/utils.libsonnet';
$.queryPanel(
[
|||
sum(cluster_job_integration:cortex_alertmanager_notifications_total:rate5m{%s})
sum(%s_job_integration:cortex_alertmanager_notifications_total:rate5m{%s})
-
sum(cluster_job_integration:cortex_alertmanager_notifications_failed_total:rate5m{%s})
||| % [$.jobMatcher($._config.job_names.alertmanager), $.jobMatcher($._config.job_names.alertmanager)],
'sum(cluster_job_integration:cortex_alertmanager_notifications_failed_total:rate5m{%s})' % $.jobMatcher($._config.job_names.alertmanager),
sum(%s_job_integration:cortex_alertmanager_notifications_failed_total:rate5m{%s})
||| % [$._config.per_cluster_label, $.jobMatcher($._config.job_names.alertmanager), $._config.per_cluster_label, $.jobMatcher($._config.job_names.alertmanager)],
'sum(%s_job_integration:cortex_alertmanager_notifications_failed_total:rate5m{%s})' % [$._config.per_cluster_label, $.jobMatcher($._config.job_names.alertmanager)],
],
['success', 'failed']
)
Expand All @@ -61,13 +61,13 @@ local utils = import 'mixin-utils/utils.libsonnet';
[
|||
(
sum(cluster_job_integration:cortex_alertmanager_notifications_total:rate5m{%s}) by(integration)
sum(%s_job_integration:cortex_alertmanager_notifications_total:rate5m{%s}) by(integration)
-
sum(cluster_job_integration:cortex_alertmanager_notifications_failed_total:rate5m{%s}) by(integration)
sum(%s_job_integration:cortex_alertmanager_notifications_failed_total:rate5m{%s}) by(integration)
) > 0
or on () vector(0)
||| % [$.jobMatcher($._config.job_names.alertmanager), $.jobMatcher($._config.job_names.alertmanager)],
'sum(cluster_job_integration:cortex_alertmanager_notifications_failed_total:rate5m{%s}) by(integration)' % $.jobMatcher($._config.job_names.alertmanager),
||| % [$._config.per_cluster_label, $.jobMatcher($._config.job_names.alertmanager), $._config.per_cluster_label, $.jobMatcher($._config.job_names.alertmanager)],
'sum(%s_job_integration:cortex_alertmanager_notifications_failed_total:rate5m{%s}) by(integration)' % [$._config.per_cluster_label, $.jobMatcher($._config.job_names.alertmanager)],
],
['success - {{ integration }}', 'failed - {{ integration }}']
)
Expand Down Expand Up @@ -104,15 +104,15 @@ local utils = import 'mixin-utils/utils.libsonnet';
.addPanel(
$.panel('Per %s alerts' % $._config.per_instance_label) +
$.queryPanel(
'sum by(%s) (cluster_job_%s:cortex_alertmanager_alerts:sum{%s})' % [$._config.per_instance_label, $._config.per_instance_label, $.jobMatcher($._config.job_names.alertmanager)],
'sum by(%s) (%s_job_%s:cortex_alertmanager_alerts:sum{%s})' % [$._config.per_instance_label, $._config.per_cluster_label, $._config.per_instance_label, $.jobMatcher($._config.job_names.alertmanager)],
'{{%s}}' % $._config.per_instance_label
) +
$.stack
)
.addPanel(
$.panel('Per %s silences' % $._config.per_instance_label) +
$.queryPanel(
'sum by(%s) (cluster_job_%s:cortex_alertmanager_silences:sum{%s})' % [$._config.per_instance_label, $._config.per_instance_label, $.jobMatcher($._config.job_names.alertmanager)],
'sum by(%s) (%s_job_%s:cortex_alertmanager_silences:sum{%s})' % [$._config.per_instance_label, $._config.per_cluster_label, $._config.per_instance_label, $.jobMatcher($._config.job_names.alertmanager)],
'{{%s}}' % $._config.per_instance_label
) +
$.stack
Expand Down Expand Up @@ -205,11 +205,11 @@ local utils = import 'mixin-utils/utils.libsonnet';
$.queryPanel(
[
|||
sum(cluster_job:cortex_alertmanager_state_replication_total:rate5m{%s})
sum(%s_job:cortex_alertmanager_state_replication_total:rate5m{%s})
-
sum(cluster_job:cortex_alertmanager_state_replication_failed_total:rate5m{%s})
||| % [$.jobMatcher($._config.job_names.alertmanager), $.jobMatcher($._config.job_names.alertmanager)],
'sum(cluster_job:cortex_alertmanager_state_replication_failed_total:rate5m{%s})' % $.jobMatcher($._config.job_names.alertmanager),
sum(%s_job:cortex_alertmanager_state_replication_failed_total:rate5m{%s})
||| % [$._config.per_cluster_label, $.jobMatcher($._config.job_names.alertmanager), $._config.per_cluster_label, $.jobMatcher($._config.job_names.alertmanager)],
'sum(%s_job:cortex_alertmanager_state_replication_failed_total:rate5m{%s})' % [$._config.per_cluster_label, $.jobMatcher($._config.job_names.alertmanager)],
],
['success', 'failed']
)
Expand All @@ -219,11 +219,11 @@ local utils = import 'mixin-utils/utils.libsonnet';
$.queryPanel(
[
|||
sum(cluster_job:cortex_alertmanager_partial_state_merges_total:rate5m{%s})
sum(%s_job:cortex_alertmanager_partial_state_merges_total:rate5m{%s})
-
sum(cluster_job:cortex_alertmanager_partial_state_merges_failed_total:rate5m{%s})
||| % [$.jobMatcher($._config.job_names.alertmanager), $.jobMatcher($._config.job_names.alertmanager)],
'sum(cluster_job:cortex_alertmanager_partial_state_merges_failed_total:rate5m{%s})' % $.jobMatcher($._config.job_names.alertmanager),
sum(%s_job:cortex_alertmanager_partial_state_merges_failed_total:rate5m{%s})
||| % [$._config.per_cluster_label, $.jobMatcher($._config.job_names.alertmanager), $._config.per_cluster_label, $.jobMatcher($._config.job_names.alertmanager)],
'sum(%s_job:cortex_alertmanager_partial_state_merges_failed_total:rate5m{%s})' % [$._config.per_cluster_label, $.jobMatcher($._config.job_names.alertmanager)],
],
['success', 'failed']
)
Expand Down
18 changes: 9 additions & 9 deletions operations/mimir-mixin/dashboards/dashboard-utils.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -54,17 +54,17 @@ local utils = import 'mixin-utils/utils.libsonnet';
if $._config.singleBinary
then d.addMultiTemplate('job', 'cortex_build_info', 'job')
else d
.addMultiTemplate('cluster', 'cortex_build_info', 'cluster')
.addMultiTemplate('namespace', 'cortex_build_info{cluster=~"$cluster"}', 'namespace')
.addMultiTemplate('cluster', 'cortex_build_info', '%s' % $._config.per_cluster_label)
.addMultiTemplate('namespace', 'cortex_build_info{%s=~"$cluster"}' % $._config.per_cluster_label, 'namespace')
else
if $._config.singleBinary
then d.addTemplate('job', 'cortex_build_info', 'job')
else d
.addTemplate('cluster', 'cortex_build_info', 'cluster')
.addTemplate('namespace', 'cortex_build_info{cluster=~"$cluster"}', 'namespace'),
.addTemplate('cluster', 'cortex_build_info', '%s' % $._config.per_cluster_label)
.addTemplate('namespace', 'cortex_build_info{%s=~"$cluster"}' % $._config.per_cluster_label, 'namespace'),

addActiveUserSelectorTemplates()::
self.addTemplate('user', 'cortex_ingester_active_series{cluster=~"$cluster", namespace=~"$namespace"}', 'user'),
self.addTemplate('user', 'cortex_ingester_active_series{%s=~"$cluster", namespace=~"$namespace"}' % $._config.per_cluster_label, 'user'),

addCustomTemplate(name, values, defaultIndex=0):: self {
templating+: {
Expand Down Expand Up @@ -99,17 +99,17 @@ local utils = import 'mixin-utils/utils.libsonnet';
jobMatcher(job)::
if $._config.singleBinary
then 'job=~"$job"'
else 'cluster=~"$cluster", job=~"($namespace)/(%s)"' % job,
else '%s=~"$cluster", job=~"($namespace)/(%s)"' % [$._config.per_cluster_label, job],

namespaceMatcher()::
if $._config.singleBinary
then 'job=~"$job"'
else 'cluster=~"$cluster", namespace=~"$namespace"',
else '%s=~"$cluster", namespace=~"$namespace"' % $._config.per_cluster_label,

jobSelector(job)::
if $._config.singleBinary
then [utils.selector.noop('cluster'), utils.selector.re('job', '$job')]
else [utils.selector.re('cluster', '$cluster'), utils.selector.re('job', '($namespace)/(%s)' % job)],
then [utils.selector.noop('%s' % $._config.per_cluster_label), utils.selector.re('job', '$job')]
else [utils.selector.re('%s' % $._config.per_cluster_label, '$cluster'), utils.selector.re('job', '($namespace)/(%s)' % job)],

queryPanel(queries, legends, legendLink=null)::
super.queryPanel(queries, legends, legendLink) + {
Expand Down
4 changes: 2 additions & 2 deletions operations/mimir-mixin/dashboards/overrides.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ local utils = import 'mixin-utils/utils.libsonnet';
datasource: '${datasource}',
targets: [
{
expr: 'max by(limit_name) (cortex_limits_defaults{cluster=~"$cluster",namespace=~"$namespace"})',
expr: 'max by(limit_name) (cortex_limits_defaults{%s=~"$cluster",namespace=~"$namespace"})' % $._config.per_cluster_label,
instant: true,
legendFormat: '',
refId: 'A',
Expand Down Expand Up @@ -69,7 +69,7 @@ local utils = import 'mixin-utils/utils.libsonnet';
datasource: '${datasource}',
targets: [
{
expr: 'max by(user, limit_name) (cortex_limits_overrides{cluster=~"$cluster",namespace=~"$namespace",user=~"${tenant_id}"})',
expr: 'max by(user, limit_name) (cortex_limits_overrides{%s=~"$cluster",namespace=~"$namespace",user=~"${tenant_id}"})' % $._config.per_cluster_label,
instant: true,
legendFormat: '',
refId: 'A',
Expand Down
Loading