Skip to content

High Metrics Cardinality Issue: Split tidb_tikvclient_request_seconds Metric to Reduce Time Series Explosion #1832

@HaoW30

Description

@HaoW30

Background

In large-scale deployments, the tidb_tikvclient_request_seconds histogram metric is a major contributor to metrics cardinality, accounting for approximately 25-30% of total time series. This creates significant operational overhead in metrics collection, storage, and querying.

Current Implementation

The metric is currently defined with four label dimensions:

TiKVSendReqHistogram = prometheus.NewHistogramVec(
    prometheus.HistogramOpts{
        Namespace:   namespace,
        Subsystem:   subsystem,
        Name:        "request_seconds",
        Help:        "Bucketed histogram of sending request duration.",
        Buckets:     prometheus.ExponentialBuckets(0.0005, 2, 24), // 0.5ms ~ 1.2h
        ConstLabels: constLabels,
    }, []string{LblType, LblStore, LblStaleRead, LblScope})

Cardinality Impact

This creates time series based on:

  • ~40 request types (Cop, Get, Prewrite, Commit, etc.)
  • N stores (number of TiKV instances)
  • 2 stale_read values (true/false)
  • 2 scope values (local/global)
  • 24 histogram buckets

Total: ~3,840N time series (e.g., 384,000 time series for a 100-store cluster)

Root Cause

The metric combines orthogonal dimensions that are rarely queried together. In practice:

  1. Cluster-wide analysis focuses on request types: "What's the P99 for Prewrite requests?"
  2. Store-level analysis focuses on individual stores: "Is store X slow?"
  3. Cross-dimensional queries (type X on store Y) are rare and better served by TiKV server-side metrics

The current design optimizes for the least common use case, causing massive cardinality explosion.

Proposed Solution

Split into two separate metrics:

1. By Request Type (cluster-wide)

// tidb_tikvclient_request_seconds
prometheus.NewHistogramVec(
    prometheus.HistogramOpts{
        Namespace:   namespace,
        Subsystem:   subsystem,
        Name:        "request_seconds",
        Help:        "Bucketed histogram of sending request duration (cluster-wide).",
        Buckets:     prometheus.ExponentialBuckets(0.0005, 2, 24),
        ConstLabels: constLabels,
    }, []string{LblType, LblStaleRead, LblScope})

Cardinality: ~3,840 time series (constant, independent of cluster size)

2. By Store (aggregated across types)

// tidb_tikvclient_request_seconds_by_store
prometheus.NewHistogramVec(
    prometheus.HistogramOpts{
        Namespace:   namespace,
        Subsystem:   subsystem,
        Name:        "request_seconds_by_store",
        Help:        "Bucketed histogram of sending request duration by store (aggregate across types).",
        Buckets:     prometheus.ExponentialBuckets(0.0005, 2, 24),
        ConstLabels: constLabels,
    }, []string{LblStore, LblStaleRead, LblScope})

Cardinality: ~96N time series

Impact

Before: ~3,840N time series
After: ~3,840 + 96N time series

For a 100-store cluster:

  • Before: 384,000 time series
  • After: 13,440 time series
  • Reduction: ~96.5% (370,560 fewer time series)

Observability Impact

This change preserves observability for all common query patterns:

Use Case Metric Status
P99 by request type (cluster-wide) tidb_tikvclient_request_seconds{type="Prewrite"} ✅ Unchanged
P99 by store (all types) tidb_tikvclient_request_seconds_by_store{store="tikv-1"} ✅ New metric
P99 for type X on store Y tikv_grpc_msg_duration_seconds (server-side) ✅ Use TiKV metric
Request rate by type rate(tidb_tikvclient_request_seconds_count[5m]) ✅ Unchanged
Store-level request rate rate(tidb_tikvclient_request_seconds_by_store_count[5m]) ✅ New metric

Implementation Approach

  1. Add new metric tidb_tikvclient_request_seconds_by_store with {store, stale_read, scope} labels
  2. Modify existing metric to remove LblStore, keeping {type, stale_read, scope} labels
  3. Update all call sites that populate TiKVSendReqHistogram to record to both metrics appropriately
  4. Update documentation and example queries

Why This Matters

This particularly benefits large TiDB clusters (50+ TiKV stores) where metrics cardinality becomes an operational challenge, causing:

  • High OpenTelemetry collector CPU/memory usage
  • Prometheus storage pressure and query performance issues
  • Hitting cardinality limits in metrics backends

A single metric change can reduce total cluster time series by 25-30% in large deployments.

Alternatives Considered

  1. Reduce histogram buckets - Rejected: Loses percentile accuracy, minimal reduction
  2. Prometheus relabeling - Rejected: Pushes complexity to users, doesn't solve source overhead
  3. Metric sampling - Rejected: Loses accuracy for critical performance metrics
  4. Optional store label - Rejected: Doesn't address combinatorial explosion

Happy to contribute the implementation if this proposal is accepted.

Metadata

Metadata

Assignees

No one assigned

    Labels

    contributionThis PR is from a community contributor.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions