Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Segment Replication aggregate metrics are misleading at a node level #10123

Open
mch2 opened this issue Sep 19, 2023 · 0 comments
Open
Labels
bug Something isn't working Search:Remote Search v2.11.0 Issues and PRs related to version 2.11.0

Comments

@mch2
Copy link
Member

mch2 commented Sep 19, 2023

Recently metrics for Segment Replication were added to index/node/cluster level APIs. The metrics include a max bytes behind, max replication lag, and total bytes behind at each of these levels.

These metrics are computed by the primary shard for each replication group within its ReplicationTracker. These metrics were intended to be used to apply backpressure when the primary identifies its replicas is falling behind. Using these metrics means that when rolled up they are not representative of their label. For example - At a node level, bytes behind metrics will actually be the max/total bytes ahead the primaries that exist on that node are compared to their replicas that are distributed across the cluster. To identify lagging nodes, this is not the correct metric to use and is misleading.

I propose we rename these metric labels appropriately and add new metrics for bytes behind that is computed from the replica's perspective. We can compute them by:

  1. Store on replicas received checkpoints from the primary
  2. Start a timer for each checkpoint
  3. Clear the timers once replicas complete a sync
  4. Compute replication stats per replica with these two fields - bytes behind can be computed from the metadata sent in the latest received checkpoint, while the lag is the ongoing time of the earliest received checkpoint.

In doing this, we will have two sets of metrics - one set computed from a replica's perspective according to its latest received checkpoint which means it does not account for the time taken to publish checkpoints and another from the primary's perspective according to its latest refreshed checkpoint which accounts for publish time.

@mch2 mch2 added bug Something isn't working untriaged labels Sep 19, 2023
@mch2 mch2 changed the title [BUG] Segment Replication aggregate metrics are misleading at a node level. [BUG] Segment Replication aggregate metrics are misleading at a node level Sep 19, 2023
@mch2 mch2 added the v2.11.0 Issues and PRs related to version 2.11.0 label Oct 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Search:Remote Search v2.11.0 Issues and PRs related to version 2.11.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant