Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor segRep metrics to also compute on replica. #9743

Closed
wants to merge 3 commits into from

Conversation

mch2
Copy link
Member

@mch2 mch2 commented Sep 4, 2023

Description

This PR updates Segment Replication stats tcompute bytes behind and lag metrics from replicas. This is critical so that on rollup at node level the bytes behind metric is an estimate of how far the replicas are on that node are from their primaries while bytes ahead is how far primaries on that node are ahead of their replicas.

Related Issues

Resolves #10123

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

github-actions bot commented Sep 4, 2023

Compatibility status:

Checks if related components are compatible with change e92aafe

Incompatible components

Incompatible components: [https://github.com/opensearch-project/k-nn.git]

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/neural-search.git]

@github-actions
Copy link
Contributor

github-actions bot commented Sep 4, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Sep 4, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Sep 4, 2023

Compatibility status:

Checks if related components are compatible with change 072ae99

Incompatible components

Incompatible components: [https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/sql.git, https://github.com/opensearch-project/neural-search.git]

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git]

This PR updates Segment Replication to also compute bytes behind and lag metrics from a replica.
It also differentiates these metrics when rolled up from primary captured (bytes ahead) and replica captured (bytes behind).
This is critical so that on rollup at node level the bytes behind metric is how far the replicas are on that node are from their primaries
while bytes ahead is how far primaries on that node are ahead of their replicas.

Signed-off-by: Marc Handalian <handalm@amazon.com>
Signed-off-by: Marc Handalian <handalm@amazon.com>
Signed-off-by: Marc Handalian <handalm@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@mch2
Copy link
Member Author

mch2 commented Sep 25, 2023

This test is failing with these changes SegmentReplicationResizeRequestIT.testCreateShrinkIndexThrowsExceptionWhenReplicasBehind because the test is immediately refreshing and then verifying the bytes behind metric. The metric is showing as 0 given replicas have not yet received a checkpoint for the latest refresh point.

I don't think the resize API should be relying on these stats APIs to determine if replicas are current as they are estimates. We should instead be validating processed sequence numbers against its max to truly determine if a shard is current to the latest written ops.

@kotwanikunal
Copy link
Member

This draft PR has been stalled for a few weeks. Please re-open if you are still working on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Search:Remote Search
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Segment Replication aggregate metrics are misleading at a node level
2 participants