-
Notifications
You must be signed in to change notification settings - Fork 25.4k
Keep track of desired nodes status in cluster state #87474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keep track of desired nodes status in cluster state #87474
Conversation
e002c53
to
319e9d6
Compare
I opened fcofdez#1 with the |
I'm checking why the upgrade test is failing, but I think the PR is ready for review. |
Pinging @elastic/es-distributed (Team:Distributed) |
Hi @fcofdez, I've created a changelog YAML for you. |
Pinging @elastic/clients-team (Team:Clients) |
…ez/elasticsearch into desired-nodes-status-cluster-state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read through most of the production code, looks like a good direction.
if (desiredNodes == null) { | ||
return clusterState; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This null check seems redundant?
return clusterState; | ||
} | ||
|
||
final Map<String, DesiredNodeWithStatus> updatedStateDesiredNodes = new HashMap<>(desiredNodes.nodes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this map null initially, initialize it during the loop and use the non-nullness instead of the statusModified
flag? Doing so avoids some work for all the cases where there is no change.
if (in.getVersion().onOrAfter(STATUS_TRACKING_SUPPORT_VERSION)) { | ||
status = Status.fromValue(in.readShort()); | ||
} else { | ||
status = Status.defaultStatus(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if PENDING is the right status here? I would rather assume that information from previous versions were actualized, since a forever pending node is unexpected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that makes sense, and I think we should get a new desired nodes version in this scenario since the cluster is upgraded at that point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One downside of using ACTUALIZED
here is that we might end up taking a decision based on made-up information, i.e. a node in certain tier was supposed to be join to the cluster but never did, we might decide to move some shards to that inexistent tier? Maybe I'm overthinking this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is true. I think we could perhaps fix this by also updating actualized state in JoinTaskExecutor.becomeMasterAndTrimConflictingNodes
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already cover that as when we call becomeMasterAndTrimConflictingNodes
we set nodesChanged
and we update the desired nodes status when some of the nodes have changed.
(ByteSizeValue) args[4], | ||
(Version) args[5] | ||
), | ||
args[6] == null ? Status.defaultStatus() : (Status) args[6] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd find it easier to read to put Status.PENDING
here, I am not sure the defaultStatus
method adds value? At least I had to go and look up what constant it returns.
I also wonder if this case should use actualized instead. I suppose a master reading this from xcontent would see all nodes join and thus this might be ok here, do you have input to this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reasoning here was that we would end up updating to the proper status as the cluster is upgraded and nodes re-join the cluster (we would reconcile the status at that point), so if there's a missing node we'll still see it as PENDING
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a rolling upgrade, I would think that an upgraded master keeps the list of nodes in cluster state and thus the way we update desired nodes inside the if (nodesChanged)
in JoinTaskExecutor
will (I think) not work to fix this. The new master will be joined by a number of existing nodes.
We should probably write a test case to demonstrate that this works. A rolling upgrade style test with multiple nodes should not be too bad to write perhaps - we just need to set desired nodes on old version, then rolling upgrade, then check that they are all actualized on new version? Happy to go some other route too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, above is mostly relevant for the stream-reading case. Clearly, if we read from xcontent, we would be joined by all nodes, so this case does sound good. Still would be very nice to test it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably write a test case to demonstrate that this works. A rolling upgrade style test with multiple nodes should not be too bad to write perhaps - we just need to set desired nodes on old version, then rolling upgrade, then check that they are all actualized on new version? Happy to go some other route too.
DesiredNodesUpgradeIT
tests that scenario, I've just added the assertion to ensure that all nodes are ACTUALIZED
when the cluster is upgraded.
"GET" | ||
] | ||
} | ||
] | ||
}, | ||
"params": { | ||
"include_status": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we need this flag? I think we discussed it but do not precisely recall a conclusion.
It sort of increases our BWC surface. And I am in doubt if anyone will ever want to not see the status. An orchestration system should build up their next desired nodes independently of GET _internal/desired_nodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, it doesn't add much value. I added the flag mostly to ensure that upgrades work as expected, so not a very good reason to increase our BWC surface.
|
||
@Override | ||
public void onFailure(Exception e) { | ||
if (MasterService.isPublishFailureException(e)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this essential? I am not sure I follow, i.e., if reroute fails for other reasons than publishing, it seems like a bug in our code?
It could simplify all of this a lot if we just succeeded all task-contexts and overrode clusterStatePublished
on the executor to do the reroute, what would go wrong if we did this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could simplify all of this a lot if we just succeeded all task-contexts and overrode clusterStatePublished on the executor to do the reroute, what would go wrong if we did this?
I agree, this is a bit complex but it has the nice property of notifying the update desired nodes listeners after the reroute finishes. It's true that this is not super important for orchestrators though, it mostly simplifies our testing code. I'll take your approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through most of the code and wanted to provide my comments now. This direction looks good though I fear that there is an edge case we are not catching, see comments.
rest-api-spec/src/main/resources/rest-api-spec/api/_internal.get_desired_nodes.json
Outdated
Show resolved
Hide resolved
server/src/internalClusterTest/java/org/elasticsearch/cluster/DesiredNodesStatusIT.java
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/cluster/metadata/DesiredNodesTests.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/cluster/metadata/DesiredNodesTestCase.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/cluster/metadata/DesiredNodesTests.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/cluster/metadata/DesiredNodesTests.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/cluster/coordination/JoinTaskExecutor.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/cluster/metadata/DesiredNodeWithStatus.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/cluster/metadata/DesiredNodes.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few final comments otherwise this looks good to me.
server/src/test/java/org/elasticsearch/cluster/metadata/DesiredNodesTestCase.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/cluster/metadata/DesiredNodes.java
Outdated
Show resolved
Hide resolved
...a/org/elasticsearch/action/admin/cluster/desirednodes/TransportUpdateDesiredNodesAction.java
Outdated
Show resolved
Hide resolved
...a/org/elasticsearch/action/admin/cluster/desirednodes/TransportUpdateDesiredNodesAction.java
Outdated
Show resolved
Hide resolved
qa/rolling-upgrade/src/test/java/org/elasticsearch/upgrades/DesiredNodesUpgradeIT.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/cluster/metadata/DesiredNodeWithStatus.java
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/cluster/coordination/JoinTaskExecutorTests.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/cluster/coordination/JoinTaskExecutorTests.java
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/cluster/coordination/JoinTaskExecutorTests.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/cluster/metadata/DesiredNodesTests.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for all your efforts on this.
Thanks for the review Henning! |
This commit adds desired nodes status tracking to the cluster state. Previously status was tracked
in-memory by
DesiredNodesMembershipService
this approach had certain limitations, and madethe consumer code more complex. This takes a simpler approach to keep the status updated when
the desired nodes are updated or when a new node joins, storing the status in the cluster state,
this allows to consume that information easily where it is necessary.
Additionally, this commit moves test code from depending directly of
DesiredNodes
which can beseen as an internal data structure to rely more on
UpdateDesiredNodesRequest
.Relates #84165