Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix getting replication type in NodeVersionAllocationDecider #12811

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

KunjueYu
Copy link

Description

This PR fix incorrect way of getting replication type from node settings in org.opensearch.cluster.routing.allocation.decider.NodeVersionAllocationDecider. Instead, we should get the replication type from index meta data. Besides, I add a test which verifies that the primary shard can be allocated to a node with higher version when replication type is document.

Related Issues

Resolves #12744

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

❌ Gradle check result for d1b71d2: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Mar 21, 2024

Compatibility status:

Checks if related components are compatible with change 5c30a74

Incompatible components

Skipped components

Compatible components

Compatible components: [https://github.com/opensearch-project/custom-codecs.git, https://github.com/opensearch-project/asynchronous-search.git, https://github.com/opensearch-project/anomaly-detection.git, https://github.com/opensearch-project/flow-framework.git, https://github.com/opensearch-project/job-scheduler.git, https://github.com/opensearch-project/cross-cluster-replication.git, https://github.com/opensearch-project/reporting.git, https://github.com/opensearch-project/opensearch-oci-object-storage.git, https://github.com/opensearch-project/geospatial.git, https://github.com/opensearch-project/common-utils.git, https://github.com/opensearch-project/k-nn.git, https://github.com/opensearch-project/alerting.git, https://github.com/opensearch-project/neural-search.git, https://github.com/opensearch-project/performance-analyzer-rca.git, https://github.com/opensearch-project/security-analytics.git, https://github.com/opensearch-project/notifications.git, https://github.com/opensearch-project/ml-commons.git, https://github.com/opensearch-project/index-management.git, https://github.com/opensearch-project/security.git, https://github.com/opensearch-project/observability.git, https://github.com/opensearch-project/performance-analyzer.git, https://github.com/opensearch-project/sql.git]

Signed-off-by: ludongyu <heidongjidian@163.com>
Signed-off-by: ludongyu <heidongjidian@163.com>
@KunjueYu KunjueYu force-pushed the fix-NodeVersionAllocationDecider branch from 47434ca to 5c30a74 Compare March 21, 2024 07:00
Copy link
Contributor

✅ Gradle check result for 47434ca: SUCCESS

Copy link

codecov bot commented Mar 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.36%. Comparing base (b15cb0c) to head (5c30a74).
Report is 627 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #12811      +/-   ##
============================================
- Coverage     71.42%   71.36%   -0.06%     
- Complexity    59978    60202     +224     
============================================
  Files          4985     5011      +26     
  Lines        282275   283557    +1282     
  Branches      40946    41089     +143     
============================================
+ Hits         201603   202373     +770     
- Misses        63999    64407     +408     
- Partials      16673    16777     +104     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

❕ Gradle check result for 5c30a74: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.remotestore.SegmentReplicationUsingRemoteStoreIT.testRestartPrimary_NoReplicas

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link
Collaborator

@Bukhtawar Bukhtawar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this

@kkmr
Copy link
Contributor

kkmr commented Mar 26, 2024

Thanks for submitting the PR. I have a couple of questions:

  1. What is the expected behaviour? Is it that we should be able to move replicas between "SEgment replication" nodes to "Document replication"?
  2. Is there an forward-backward compatiability issues to deploy this fix in production?

@KunjueYu
Copy link
Author

Thanks for submitting the PR. I have a couple of questions:

  1. What is the expected behaviour? Is it that we should be able to move replicas between "SEgment replication" nodes to "Document replication"?
  2. Is there an forward-backward compatiability issues to deploy this fix in production?
  1. the expected behavior is that the primary shard of segment-replication can not be allocated to a node with higher version than the node which the replica shard is on, while the primary shard of document-replication can be allocated to a node with higher version.
    As the replication type setting is a index-scope setting, there is not any "Segment replication" nodes or "Document replication" node. There is only "Segment replication" index or "Document replication" index.
  2. I don't think there is any forward-backward compatibility issues to deploy this fix in production.

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added stalled Issues that have stalled and removed stalled Issues that have stalled labels Apr 26, 2024
@KunjueYu KunjueYu requested a review from Bukhtawar April 28, 2024 08:57
@linuxpi
Copy link
Collaborator

linuxpi commented May 2, 2024

[Storage Triage - attendees 1 2 3 4 5 6 7 8 9 10 11 12 13]

@KunjueYu Thanks for opening this PR. Please add a release target label and double check if this is actually related to Remote Store, else remove the label

@KunjueYu
Copy link
Author

KunjueYu commented May 6, 2024

[Storage Triage - attendees 1 2 3 4 5 6 7 8 9 10 11 12 13]

@KunjueYu Thanks for opening this PR. Please add a release target label and double check if this is actually related to Remote Store, else remove the label

I don't have the permission to edit the labels of this PR. This PR is not related to Remote Store, so the label should be removed. I am not familiar with choosing the release target label, maybe label v2.13.1 can be added ?

@linuxpi
Copy link
Collaborator

linuxpi commented May 8, 2024

[Storage Triage - attendees 1 2 3 4 5 6 7 8 9 10 11 12 13]
@KunjueYu Thanks for opening this PR. Please add a release target label and double check if this is actually related to Remote Store, else remove the label

I don't have the permission to edit the labels of this PR. This PR is not related to Remote Store, so the label should be removed. I am not familiar with choosing the release target label, maybe label v2.13.1 can be added ?

Removed the storage label. If you are targeting v2.15 release(next release), we can add the 2.15 label

Copy link
Collaborator

@gaobinlong gaobinlong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to add some integration test cases to make sure this allocation decider works in a real cluster, we may add more test cases to ClusterAllocationExplainIT or SegmentReplicationAllocationIT.
By the way, changelog is needed for this change.


@Override
public Decision canAllocate(ShardRouting shardRouting, RoutingNode node, RoutingAllocation allocation) {
if (shardRouting.primary()) {
IndexMetadata indexMd = allocation.metadata().getIndexSafe(shardRouting.index());
final ReplicationType replicationType = IndexMetadata.INDEX_REPLICATION_TYPE_SETTING.get(indexMd.getSettings());
if (replicationType == ReplicationType.SEGMENT) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help to check why the existing unit tests don't cover this case? I see there're some test cases may cover that but they worked fine, like testRebalanceDoesNotAllocatePrimaryOnHigherVersionNodesSegrepEnabled and testRebalanceDoesNotAllocatePrimaryAndReplicasOnDifferentVersionNodes.

@@ -671,6 +668,98 @@ public void testRebalanceDoesNotAllocatePrimaryOnHigherVersionNodesSegrepEnabled
);
}

public void testCanAllocatePrimaryOnHigherVersionNodesDocRepEnabled() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since SETTING_REPLICATION_TYPE setting defaults to document replication, so could you check if this new test case is duplicated with the existing test cases?

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added stalled Issues that have stalled and removed stalled Issues that have stalled labels Jun 12, 2024
@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added the stalled Issues that have stalled label Jul 29, 2024
@linuxpi
Copy link
Collaborator

linuxpi commented Aug 1, 2024

@KunjueYu can you check @gaobinlong 's comments and help take the PR to closure

@vikasvb90 vikasvb90 added Storage Issues and PRs relating to data and metadata storage Storage:Remote and removed Indexing:Replication Issues and PRs related to core replication framework eg segrep labels Aug 5, 2024
@opensearch-trigger-bot opensearch-trigger-bot bot removed the stalled Issues that have stalled label Aug 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Cluster Manager Storage:Remote Storage Issues and PRs relating to data and metadata storage
Projects
Status: No status
Status: 🏗 In progress
Development

Successfully merging this pull request may close these issues.

[BUG] NodeVersionAllocationDecider should get Replication Type from index meta instead of node settings
6 participants