-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Description
Jira Link: DB-8907
Description
Here is the current /prometheus-metrics endpoint API:
metrics
- a CSV whitelist that filters all the metrics.
priority_regex
- allows filtering table-level metrics, which are table metrics and table-level aggregated tablet metrics.
The first issue is, CSV list can be too long and hit the URL limit.
Second, Even though we do filter per-table metrics, we still get 50+ metrics for each table from each DB node. This makes 1000 table 9 node universe to produce 450K+ per-table metrics. Having multiple such universes makes Prometheus scrape 1M+ metrics, which results in very high Prometheus memory usage. However, for many metrics in the regex filter, YBA doesn't use per-table granularity. Thus, rather than exporting them on the table level, we can aggregate them to server level to reduce memory usage.
DB should switch to a brand new API that can enable server aggregation(implemented in #18078, but current API can't turn on server-level aggregation for all metrics due to backward compatibility concerns.)
With the new API, DB should always aggregate tablet->table->server by default, the same metric can be exposed at different levels. And caller can specify regex (whitelist, blacklist) at each level. Thus, a cleaner API should look like:
table_allowlist, table_blocklist
- filter table-level metrics based on the regex
server_allowlist, server_blocklist
- filter server-level metrics based on the regex
Issue Type
kind/new-feature
Warning: Please confirm that this issue does not contain any sensitive information
- I confirm this issue does not contain any sensitive information.