Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to 3.0.0, run backwards compatibility with OpenSearch 2.4.0 #242

Merged
merged 3 commits into from
Sep 22, 2022

Conversation

dblock
Copy link
Member

@dblock dblock commented Sep 14, 2022

Signed-off-by: dblock dblock@dblock.org

Description

OpenSearch 3.0 cannot be directly upgraded from OpenSearch 1.x. This changes the BCW to use OpenSearch 2.4.0. The latest version of the job-scheduler plugin is also downloaded from CI.

We'll need to find a way to automate getting "the latest 2.x version" if we don't want to keep updating this number here every time a 2.x releases.

Issues Resolved

opensearch-project/OpenSearch#3615

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: dblock <dblock@dblock.org>
@dblock dblock requested a review from a team September 14, 2022 16:49
@dblock dblock changed the title Fix: run backwards compatibility with OpenSearch 2.1.0. Fix: run backwards compatibility with OpenSearch 2.x.0 (compat with 3.0) Sep 14, 2022
joshpalis
joshpalis previously approved these changes Sep 14, 2022
minalsha
minalsha previously approved these changes Sep 14, 2022
@dblock dblock marked this pull request as draft September 14, 2022 21:21
nknize
nknize previously requested changes Sep 15, 2022
Copy link

@nknize nknize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned this on the other issue. There's two kinds of compatibility, wire and index. For index compatibility the next major versions are compatible back to the previous major release (e.g., 3.0.0 is index compatible back to 2.0.0). For wire compatibility the next major version is only compatible back to the previous minor (e.g., 2.4.0). This is how API deprecation / removals are handled across minor releases to support rolling upgrades.

Opening a PR to make BWC testing a 3.0.0 plugin w/ any minor release breaks this deprecation / removal paradigm. It's not supported so you should see all sorts of BWC test failures. We'd have to redesign core and introduce an internal API versioning scheme to change this design which would require us to support API deprecation / removals across the entire minor suite of a major line. This has implications deeper into deprecation / removals w/in lucene compatibility.

@nknize
Copy link

nknize commented Sep 15, 2022

Caused by: org.opensearch.cluster.coordination.CoordinationStateRejectedException: incoming term 1 not greater than current term 2

This exception during the master election process is due to wire incompatibility when trying to send the cluster state. Either the exception for the version compatibility check is getting swallowed somewhere or it's incorrectly passing. Either way, this is due to 3.0.0 not being transport compatible w/ 2.1.0 and the error message is just not very useful at communicating that here.

@dblock
Copy link
Member Author

dblock commented Sep 15, 2022

@nknize I understand. You're saying that this PR should test against 2.4.0 (whatever is in 2.x). Now I just need to make it work. Do you know of a dynamic way to get the "2.x -> 2.4.0" information? Because I need to fish the job-scheduler plugin out of 2.4.0. Or otherwise maybe try to build the job-scheduler from 2.x as well instead of trying to download it...

@nknize
Copy link

nknize commented Sep 15, 2022

Or otherwise maybe try to build the job-scheduler from 2.x as well instead of trying to download it...

Correct. For BWC testing the plugin repository's branching process and logic needs to be consistent w/ the core branching logic. That is the 2.x branch should always be the next release (in this case 2.4).

@dblock dblock dismissed stale reviews from minalsha and joshpalis via a5c551a September 15, 2022 19:35
Signed-off-by: dblock <dblock@dblock.org>
@dblock
Copy link
Member Author

dblock commented Sep 15, 2022

Putting everything on 2.4.0. I see a failure starting the cluster. So far not able to fix it :(

./gradlew :opensearch-job-scheduler-sample-extension:jobSchedulerBwcCluster#oldVersionClusterTask1
[2022-09-15T19:50:38,131][INFO ][o.o.c.c.JoinHelper       ] [jobSchedulerBwcCluster1-2] failed to join {jobSchedulerBwcCluster1-0}{V4WupgDqQ3iqweU4eOctrg}{ZViVa0fIQMmAcowurwWASg}{127.0.0.1}{127.0.0.1:41101}{dimr}{testattr=test, shard_indexing_pressure_enabled=true} with JoinRequest{sourceNode={jobSchedulerBwcCluster1-2}{5MdodgRWRkCzQW_YWCGxhg}{usWKIy_RTLukWftG5wyc7A}{127.0.0.1}{127.0.0.1:45533}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, minimumTerm=0, optionalJoin=Optional[Join{term=1, lastAcceptedTerm=0, lastAcceptedVersion=0, sourceNode={jobSchedulerBwcCluster1-2}{5MdodgRWRkCzQW_YWCGxhg}{usWKIy_RTLukWftG5wyc7A}{127.0.0.1}{127.0.0.1:45533}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, targetNode={jobSchedulerBwcCluster1-0}{V4WupgDqQ3iqweU4eOctrg}{ZViVa0fIQMmAcowurwWASg}{127.0.0.1}{127.0.0.1:41101}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}}]}
org.opensearch.transport.RemoteTransportException: [jobSchedulerBwcCluster1-0][127.0.0.1:41101][internal:cluster/coordination/join]
Caused by: org.opensearch.cluster.coordination.CoordinationStateRejectedException: incoming term 1 does not match current term 2
	at org.opensearch.cluster.coordination.CoordinationState.handleJoin(CoordinationState.java:256) ~[opensearch-2.4.0.jar:2.4.0]
	at org.opensearch.cluster.coordination.Coordinator.handleJoin(Coordinator.java:1172) ~[opensearch-2.4.0.jar:2.4.0]
	at java.util.Optional.ifPresent(Optional.java:178) ~[?:?]
	at org.opensearch.cluster.coordination.Coordinator.processJoinRequest(Coordinator.java:640) ~[opensearch-2.4.0.jar:2.4.0]
	at org.opensearch.cluster.coordination.Coordinator.lambda$handleJoinRequest$7(Coordinator.java:603) ~[opensearch-2.4.0.jar:2.4.0]
	at org.opensearch.action.ActionListener$1.onResponse(ActionListener.java:80) ~[opensearch-2.4.0.jar:2.4.0]
	at org.opensearch.transport.ClusterConnectionManager.connectToNode(ClusterConnectionManager.java:138) ~[opensearch-2.4.0.jar:2.4.0]
	at org.opensearch.transport.TransportService.connectToNode(TransportService.java:437) ~[opensearch-2.4.0.jar:2.4.0]
	at org.opensearch.transport.TransportService.connectToNode(TransportService.java:421) ~[opensearch-2.4.0.jar:2.4.0]
	at org.opensearch.cluster.coordination.Coordinator.handleJoinRequest(Coordinator.java:588) ~[opensearch-2.4.0.jar:2.4.0]
	at org.opensearch.cluster.coordination.JoinHelper.lambda$new$1(JoinHelper.java:184) ~[opensearch-2.4.0.jar:2.4.0]
	at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[opensearch-2.4.0.jar:2.4.0]
	at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:453) ~[opensearch-2.4.0.jar:2.4.0]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) ~[opensearch-2.4.0.jar:2.4.0]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.4.0.jar:2.4.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
[2022-09-15T19:50:38,137][DEBUG][o.o.c.c.JoinHelper       ] [jobSchedulerBwcCluster1-2] failed to join {jobSchedulerBwcCluster1-0}{V4WupgDqQ3iqweU4eOctrg}{ZViVa0fIQMmAcowurwWASg}{127.0.0.1}{127.0.0.1:41101}{dimr}{testattr=test, shard_indexing_pressure_enabled=true} with JoinRequest{sourceNode={jobSchedulerBwcCluster1-2}{5MdodgRWRkCzQW_YWCGxhg}{usWKIy_RTLukWftG5wyc7A}{127.0.0.1}{127.0.0.1:45533}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, minimumTerm=0, optionalJoin=Optional[Join{term=1, lastAcceptedTerm=0, lastAcceptedVersion=0, sourceNode={jobSchedulerBwcCluster1-2}{5MdodgRWRkCzQW_YWCGxhg}{usWKIy_RTLukWftG5wyc7A}{127.0.0.1}{127.0.0.1:45533}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}, targetNode={jobSchedulerBwcCluster1-0}{V4WupgDqQ3iqweU4eOctrg}{ZViVa0fIQMmAcowurwWASg}{127.0.0.1}{127.0.0.1:41101}{dimr}{testattr=test, shard_indexing_pressure_enabled=true}}]}
org.opensearch.transport.RemoteTransportException: [jobSchedulerBwcCluster1-0][127.0.0.1:41101][internal:cluster/coordination/join]
Caused by: org.opensearch.cluster.coordination.CoordinationStateRejectedException: incoming term 1 does not match current term 2
	at org.opensearch.cluster.coordination.CoordinationState.handleJoin(CoordinationState.java:256) ~[opensearch-2.4.0.jar:2.4.0]
	at org.opensearch.cluster.coordination.Coordinator.handleJoin(Coordinator.java:1172) ~[opensearch-2.4.0.jar:2.4.0]
	at java.util.Optional.ifPresent(Optional.java:178) ~[?:?]
	at org.opensearch.cluster.coordination.Coordinator.processJoinRequest(Coordinator.java:640) ~[opensearch-2.4.0.jar:2.4.0]
	at org.opensearch.cluster.coordination.Coordinator.lambda$handleJoinRequest$7(Coordinator.java:603) ~[opensearch-2.4.0.jar:2.4.0]
	at org.opensearch.action.ActionListener$1.onResponse(ActionListener.java:80) ~[opensearch-2.4.0.jar:2.4.0]
	at org.opensearch.transport.ClusterConnectionManager.connectToNode(ClusterConnectionManager.java:138) ~[opensearch-2.4.0.jar:2.4.0]
	at org.opensearch.transport.TransportService.connectToNode(TransportService.java:437) ~[opensearch-2.4.0.jar:2.4.0]
	at org.opensearch.transport.TransportService.connectToNode(TransportService.java:421) ~[opensearch-2.4.0.jar:2.4.0]
	at org.opensearch.cluster.coordination.Coordinator.handleJoinRequest(Coordinator.java:588) ~[opensearch-2.4.0.jar:2.4.0]
	at org.opensearch.cluster.coordination.JoinHelper.lambda$new$1(JoinHelper.java:184) ~[opensearch-2.4.0.jar:2.4.0]
	at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[opensearch-2.4.0.jar:2.4.0]
	at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:453) ~[opensearch-2.4.0.jar:2.4.0]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) ~[opensearch-2.4.0.jar:2.4.0]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.4.0.jar:2.4.0]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]

@dblock
Copy link
Member Author

dblock commented Sep 16, 2022

cc:@reta ideas?

@reta
Copy link
Contributor

reta commented Sep 17, 2022

cc:@reta ideas?

@dblock hm ... don't know that cause yet but I will dig in

Copy link

@nknize nknize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick checkin: I poked at this for a short bit and was able to manually form two separate clusters:

  1. 3 node mixed cluster (2 - 2.4.0, 1 - 3.0.0) w/ clean builds of 2.4.0 and 3.0.0 from the core repo
  2. 3 node cluster (3 - 2.4.0) w/ job-scheduler's bwc checkout of 2.4.0 (freshly built in .gradle/caches by the bwc scaffolding)

This tells me it's not an issue in core but possibly something up w/ the gradle bwc test configuration (which looks to be a copy of core's bwc config - e.g., RecoveryIT).

It's also odd to me that gradle executes the entire sample-extensions test suite despite passing in the single test using the --tests flag. But that's likely a separate issue. If I have some time I'll poke at this a little deeper today.

@dblock
Copy link
Member Author

dblock commented Sep 19, 2022

I tried to bisect the problem and didn't get anywhere, tried at cb238aae616d6a0fd8f82e128a1f94c8e4e8b1f7 (when we bumped version to 3.0), same problem. Trying to go back further makes a mess. I still can't understand why the test cluster doesn't come up though, which seems to be the key problem.

I also tried using out-of-the-box 2.4.0 on the configuration generated by the bcw tests and my cluster does't even get into discovery, loops over cluster-manager not discovered or elected yet, an election requires at least 2 nodes with ids ... over all nodes.

@saratvemulapalli
Copy link
Member

I'll take a stab at this.
I will start here:
Caused by: org.opensearch.cluster.coordination.CoordinationStateRejectedException: incoming term 1 does not match current term 2

@reta
Copy link
Contributor

reta commented Sep 20, 2022

@dblock @nknize I think I found out the cause: it seems not related to cluster manager or election, but the fact that the cluster type was set to old (property tests.rest.bwcsuite). In this case, the test verifies that old opendistro plugin should be installed.

// Create two test clusters with 3 nodes of the old version
2.times {i ->
    task "${baseName}#oldVersionClusterTask$i"(type: StandaloneRestIntegTestTask) {
        dependsOn 'prepareBwcTests'
        useCluster testClusters."${baseName}$i"
        filter {
            includeTestsMatching "org.opensearch.jobscheduler.sampleextension.bwc.*IT"
        }
        systemProperty 'tests.rest.bwcsuite', 'old_cluster' <--- problem here
        systemProperty 'tests.rest.bwcsuite_round', 'old'
        systemProperty 'tests.plugin_bwc_version', bwcPluginVersion
        nonInputProperties.systemProperty('tests.rest.cluster', "${-> testClusters."${baseName}$i".allHttpSocketURI.join(",")}")
        nonInputProperties.systemProperty('tests.clustername', "${-> testClusters."${baseName}$i".getName()}")
    }
}

Signed-off-by: dblock <dblock@dblock.org>
@dblock
Copy link
Member Author

dblock commented Sep 20, 2022

@reta THANK YOU. I am so dumb. Instead of reading the actual output of the failed test I spent days chasing a red herring on why the cluster isn't coming up. It just wasn't given enough time given the earlier assertion failure.

The old version is no longer opendistro, so I merged the OLD and MIXED assertions, getting rid of the opendistro variation that checks for an old plugin name.

@codecov-commenter
Copy link

codecov-commenter commented Sep 20, 2022

Codecov Report

Merging #242 (1b07915) into main (6cf9b30) will decrease coverage by 0.22%.
The diff coverage is n/a.

@@             Coverage Diff              @@
##               main     #242      +/-   ##
============================================
- Coverage     53.19%   52.96%   -0.23%     
+ Complexity       65       64       -1     
============================================
  Files             8        8              
  Lines           438      438              
  Branches         50       50              
============================================
- Hits            233      232       -1     
  Misses          186      186              
- Partials         19       20       +1     
Impacted Files Coverage Δ
...pensearch/jobscheduler/scheduler/JobScheduler.java 72.94% <0.00%> (-1.18%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@dblock dblock marked this pull request as ready for review September 20, 2022 20:10
@dblock dblock requested a review from nknize September 20, 2022 20:10
@dblock dblock changed the title Fix: run backwards compatibility with OpenSearch 2.x.0 (compat with 3.0) Update version to 3.0.0, run backwards compatibility with OpenSearch 2.4.0 Sep 20, 2022
@dblock dblock changed the title Update version to 3.0.0, run backwards compatibility with OpenSearch 2.4.0 Update to 3.0.0, run backwards compatibility with OpenSearch 2.4.0 Sep 20, 2022
@@ -12,3 +12,4 @@ build
bin/
.classpath
.vscode
sample-extension-plugin/src/test/resources/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it accidental? Seems like legit non-ignorable target

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not. The old code was downloading older versions of job-scheduler into src/test/resources/bwc/job-scheduler so I kept it. The folder doesn't exist in the tree. Maybe it should download to /tmp, but I didn't want to change too many things. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh ... I see, thanks for explaining, let's keep it like that indeed

@dblock dblock dismissed nknize’s stale review September 22, 2022 18:07

Let's get this unblocked!

@dblock
Copy link
Member Author

dblock commented Sep 22, 2022

@nknize I dismissed your review to get us unblocked, feel free to chime in if you think this needs more changes and I can followup

@saratvemulapalli +1 so we can merge?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants