High Metrics Cardinality Issue: Split `tidb_tikvclient_request_seconds` Metric to Reduce Time Series Explosion

### Background

In large-scale deployments, the `tidb_tikvclient_request_seconds` histogram metric is a major contributor to metrics cardinality, accounting for approximately **25-30% of total time series**. This creates significant operational overhead in metrics collection, storage, and querying.

### Current Implementation

The metric is currently defined with four label dimensions:

```go
TiKVSendReqHistogram = prometheus.NewHistogramVec(
    prometheus.HistogramOpts{
        Namespace:   namespace,
        Subsystem:   subsystem,
        Name:        "request_seconds",
        Help:        "Bucketed histogram of sending request duration.",
        Buckets:     prometheus.ExponentialBuckets(0.0005, 2, 24), // 0.5ms ~ 1.2h
        ConstLabels: constLabels,
    }, []string{LblType, LblStore, LblStaleRead, LblScope})
```

### Cardinality Impact

This creates time series based on:
- **~40 request types** (Cop, Get, Prewrite, Commit, etc.)
- **N stores** (number of TiKV instances)
- **2 stale_read values** (true/false)
- **2 scope values** (local/global)
- **24 histogram buckets**

**Total: ~3,840N time series** (e.g., 384,000 time series for a 100-store cluster)

### Root Cause

The metric combines orthogonal dimensions that are rarely queried together. In practice:

1. **Cluster-wide analysis** focuses on request types: "What's the P99 for Prewrite requests?"
2. **Store-level analysis** focuses on individual stores: "Is store X slow?"
3. **Cross-dimensional queries** (type X on store Y) are rare and better served by TiKV server-side metrics

The current design optimizes for the least common use case, causing massive cardinality explosion.
### Proposed Solution

Split into **two separate metrics**:

#### 1. By Request Type (cluster-wide)
```go
// tidb_tikvclient_request_seconds
prometheus.NewHistogramVec(
    prometheus.HistogramOpts{
        Namespace:   namespace,
        Subsystem:   subsystem,
        Name:        "request_seconds",
        Help:        "Bucketed histogram of sending request duration (cluster-wide).",
        Buckets:     prometheus.ExponentialBuckets(0.0005, 2, 24),
        ConstLabels: constLabels,
    }, []string{LblType, LblStaleRead, LblScope})
```

**Cardinality: ~3,840 time series** (constant, independent of cluster size)

#### 2. By Store (aggregated across types)
```go
// tidb_tikvclient_request_seconds_by_store
prometheus.NewHistogramVec(
    prometheus.HistogramOpts{
        Namespace:   namespace,
        Subsystem:   subsystem,
        Name:        "request_seconds_by_store",
        Help:        "Bucketed histogram of sending request duration by store (aggregate across types).",
        Buckets:     prometheus.ExponentialBuckets(0.0005, 2, 24),
        ConstLabels: constLabels,
    }, []string{LblStore, LblStaleRead, LblScope})
```

**Cardinality: ~96N time series**

### Impact

**Before:** ~3,840N time series
**After:** ~3,840 + 96N time series

For a 100-store cluster:
- Before: 384,000 time series
- After: 13,440 time series
- **Reduction: ~96.5%** (370,560 fewer time series)

### Observability Impact

This change **preserves observability** for all common query patterns:

| Use Case | Metric | Status |
|----------|--------|--------|
| P99 by request type (cluster-wide) | `tidb_tikvclient_request_seconds{type="Prewrite"}` | ✅ Unchanged |
| P99 by store (all types) | `tidb_tikvclient_request_seconds_by_store{store="tikv-1"}` | ✅ New metric |
| P99 for type X on store Y | `tikv_grpc_msg_duration_seconds` (server-side) | ✅ Use TiKV metric |
| Request rate by type | `rate(tidb_tikvclient_request_seconds_count[5m])` | ✅ Unchanged |
| Store-level request rate | `rate(tidb_tikvclient_request_seconds_by_store_count[5m])` | ✅ New metric |

### Implementation Approach

1. Add new metric `tidb_tikvclient_request_seconds_by_store` with `{store, stale_read, scope}` labels
2. Modify existing metric to remove `LblStore`, keeping `{type, stale_read, scope}` labels
3. Update all call sites that populate `TiKVSendReqHistogram` to record to both metrics appropriately
4. Update documentation and example queries

### Why This Matters

This particularly benefits **large TiDB clusters (50+ TiKV stores)** where metrics cardinality becomes an operational challenge, causing:
- High OpenTelemetry collector CPU/memory usage
- Prometheus storage pressure and query performance issues
- Hitting cardinality limits in metrics backends

A single metric change can reduce total cluster time series by 25-30% in large deployments.

### Alternatives Considered

1. **Reduce histogram buckets** - Rejected: Loses percentile accuracy, minimal reduction
2. **Prometheus relabeling** - Rejected: Pushes complexity to users, doesn't solve source overhead
3. **Metric sampling** - Rejected: Loses accuracy for critical performance metrics
4. **Optional store label** - Rejected: Doesn't address combinatorial explosion

---

Happy to contribute the implementation if this proposal is accepted.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High Metrics Cardinality Issue: Split `tidb_tikvclient_request_seconds` Metric to Reduce Time Series Explosion #1832

Background

Current Implementation

Cardinality Impact

Root Cause

Proposed Solution

1. By Request Type (cluster-wide)

2. By Store (aggregated across types)

Impact

Observability Impact

Implementation Approach

Why This Matters

Alternatives Considered

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Use Case	Metric	Status
P99 by request type (cluster-wide)	`tidb_tikvclient_request_seconds{type="Prewrite"}`	✅ Unchanged
P99 by store (all types)	`tidb_tikvclient_request_seconds_by_store{store="tikv-1"}`	✅ New metric
P99 for type X on store Y	`tikv_grpc_msg_duration_seconds` (server-side)	✅ Use TiKV metric
Request rate by type	`rate(tidb_tikvclient_request_seconds_count[5m])`	✅ Unchanged
Store-level request rate	`rate(tidb_tikvclient_request_seconds_by_store_count[5m])`	✅ New metric

High Metrics Cardinality Issue: Split tidb_tikvclient_request_seconds Metric to Reduce Time Series Explosion #1832

Description

Background

Current Implementation

Cardinality Impact

Root Cause

Proposed Solution

1. By Request Type (cluster-wide)

2. By Store (aggregated across types)

Impact

Observability Impact

Implementation Approach

Why This Matters

Alternatives Considered

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

High Metrics Cardinality Issue: Split `tidb_tikvclient_request_seconds` Metric to Reduce Time Series Explosion #1832