-
Notifications
You must be signed in to change notification settings - Fork 254
Description
Background
In large-scale deployments, the tidb_tikvclient_request_seconds histogram metric is a major contributor to metrics cardinality, accounting for approximately 25-30% of total time series. This creates significant operational overhead in metrics collection, storage, and querying.
Current Implementation
The metric is currently defined with four label dimensions:
TiKVSendReqHistogram = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Namespace: namespace,
Subsystem: subsystem,
Name: "request_seconds",
Help: "Bucketed histogram of sending request duration.",
Buckets: prometheus.ExponentialBuckets(0.0005, 2, 24), // 0.5ms ~ 1.2h
ConstLabels: constLabels,
}, []string{LblType, LblStore, LblStaleRead, LblScope})Cardinality Impact
This creates time series based on:
- ~40 request types (Cop, Get, Prewrite, Commit, etc.)
- N stores (number of TiKV instances)
- 2 stale_read values (true/false)
- 2 scope values (local/global)
- 24 histogram buckets
Total: ~3,840N time series (e.g., 384,000 time series for a 100-store cluster)
Root Cause
The metric combines orthogonal dimensions that are rarely queried together. In practice:
- Cluster-wide analysis focuses on request types: "What's the P99 for Prewrite requests?"
- Store-level analysis focuses on individual stores: "Is store X slow?"
- Cross-dimensional queries (type X on store Y) are rare and better served by TiKV server-side metrics
The current design optimizes for the least common use case, causing massive cardinality explosion.
Proposed Solution
Split into two separate metrics:
1. By Request Type (cluster-wide)
// tidb_tikvclient_request_seconds
prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Namespace: namespace,
Subsystem: subsystem,
Name: "request_seconds",
Help: "Bucketed histogram of sending request duration (cluster-wide).",
Buckets: prometheus.ExponentialBuckets(0.0005, 2, 24),
ConstLabels: constLabels,
}, []string{LblType, LblStaleRead, LblScope})Cardinality: ~3,840 time series (constant, independent of cluster size)
2. By Store (aggregated across types)
// tidb_tikvclient_request_seconds_by_store
prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Namespace: namespace,
Subsystem: subsystem,
Name: "request_seconds_by_store",
Help: "Bucketed histogram of sending request duration by store (aggregate across types).",
Buckets: prometheus.ExponentialBuckets(0.0005, 2, 24),
ConstLabels: constLabels,
}, []string{LblStore, LblStaleRead, LblScope})Cardinality: ~96N time series
Impact
Before: ~3,840N time series
After: ~3,840 + 96N time series
For a 100-store cluster:
- Before: 384,000 time series
- After: 13,440 time series
- Reduction: ~96.5% (370,560 fewer time series)
Observability Impact
This change preserves observability for all common query patterns:
| Use Case | Metric | Status |
|---|---|---|
| P99 by request type (cluster-wide) | tidb_tikvclient_request_seconds{type="Prewrite"} |
✅ Unchanged |
| P99 by store (all types) | tidb_tikvclient_request_seconds_by_store{store="tikv-1"} |
✅ New metric |
| P99 for type X on store Y | tikv_grpc_msg_duration_seconds (server-side) |
✅ Use TiKV metric |
| Request rate by type | rate(tidb_tikvclient_request_seconds_count[5m]) |
✅ Unchanged |
| Store-level request rate | rate(tidb_tikvclient_request_seconds_by_store_count[5m]) |
✅ New metric |
Implementation Approach
- Add new metric
tidb_tikvclient_request_seconds_by_storewith{store, stale_read, scope}labels - Modify existing metric to remove
LblStore, keeping{type, stale_read, scope}labels - Update all call sites that populate
TiKVSendReqHistogramto record to both metrics appropriately - Update documentation and example queries
Why This Matters
This particularly benefits large TiDB clusters (50+ TiKV stores) where metrics cardinality becomes an operational challenge, causing:
- High OpenTelemetry collector CPU/memory usage
- Prometheus storage pressure and query performance issues
- Hitting cardinality limits in metrics backends
A single metric change can reduce total cluster time series by 25-30% in large deployments.
Alternatives Considered
- Reduce histogram buckets - Rejected: Loses percentile accuracy, minimal reduction
- Prometheus relabeling - Rejected: Pushes complexity to users, doesn't solve source overhead
- Metric sampling - Rejected: Loses accuracy for critical performance metrics
- Optional store label - Rejected: Doesn't address combinatorial explosion
Happy to contribute the implementation if this proposal is accepted.