Skip to content

[DocDB] New prometheus-metrics API to fully utilize metric server aggregation #19943

@yusong-yan

Description

@yusong-yan

Jira Link: DB-8907

Description

Here is the current /prometheus-metrics endpoint API:
metrics - a CSV whitelist that filters all the metrics.
priority_regex - allows filtering table-level metrics, which are table metrics and table-level aggregated tablet metrics.
The first issue is, CSV list can be too long and hit the URL limit.
Second, Even though we do filter per-table metrics, we still get 50+ metrics for each table from each DB node. This makes 1000 table 9 node universe to produce 450K+ per-table metrics. Having multiple such universes makes Prometheus scrape 1M+ metrics, which results in very high Prometheus memory usage. However, for many metrics in the regex filter, YBA doesn't use per-table granularity. Thus, rather than exporting them on the table level, we can aggregate them to server level to reduce memory usage.

DB should switch to a brand new API that can enable server aggregation(implemented in #18078, but current API can't turn on server-level aggregation for all metrics due to backward compatibility concerns.)

With the new API, DB should always aggregate tablet->table->server by default, the same metric can be exposed at different levels. And caller can specify regex (whitelist, blacklist) at each level. Thus, a cleaner API should look like:
table_allowlist, table_blocklist - filter table-level metrics based on the regex
server_allowlist, server_blocklist - filter server-level metrics based on the regex

Issue Type

kind/new-feature

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions