Skip to content

[BUG] Segment Replication aggregate metrics are misleading at a node level #10123

Open
@mch2

Description

@mch2

Recently metrics for Segment Replication were added to index/node/cluster level APIs. The metrics include a max bytes behind, max replication lag, and total bytes behind at each of these levels.

These metrics are computed by the primary shard for each replication group within its ReplicationTracker. These metrics were intended to be used to apply backpressure when the primary identifies its replicas is falling behind. Using these metrics means that when rolled up they are not representative of their label. For example - At a node level, bytes behind metrics will actually be the max/total bytes ahead the primaries that exist on that node are compared to their replicas that are distributed across the cluster. To identify lagging nodes, this is not the correct metric to use and is misleading.

I propose we rename these metric labels appropriately and add new metrics for bytes behind that is computed from the replica's perspective. We can compute them by:

  1. Store on replicas received checkpoints from the primary
  2. Start a timer for each checkpoint
  3. Clear the timers once replicas complete a sync
  4. Compute replication stats per replica with these two fields - bytes behind can be computed from the metadata sent in the latest received checkpoint, while the lag is the ongoing time of the earliest received checkpoint.

In doing this, we will have two sets of metrics - one set computed from a replica's perspective according to its latest received checkpoint which means it does not account for the time taken to publish checkpoints and another from the primary's perspective according to its latest refreshed checkpoint which accounts for publish time.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Search:Remote SearchbugSomething isn't workingv2.11.0Issues and PRs related to version 2.11.0

    Type

    No type

    Projects

    • Status

      🆕 New

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions