Skip to content

Keep track of desired nodes status in cluster state #87474

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Jun 16, 2022

Conversation

fcofdez
Copy link
Contributor

@fcofdez fcofdez commented Jun 7, 2022

This commit adds desired nodes status tracking to the cluster state. Previously status was tracked
in-memory by DesiredNodesMembershipService this approach had certain limitations, and made
the consumer code more complex. This takes a simpler approach to keep the status updated when
the desired nodes are updated or when a new node joins, storing the status in the cluster state,
this allows to consume that information easily where it is necessary.
Additionally, this commit moves test code from depending directly of DesiredNodes which can be
seen as an internal data structure to rely more on UpdateDesiredNodesRequest.

Relates #84165

@fcofdez fcofdez force-pushed the desired-nodes-status-cluster-state branch from e002c53 to 319e9d6 Compare June 8, 2022 08:57
@fcofdez fcofdez changed the title Keep track of desired nodes membership in cluster state Keep track of desired nodes status in cluster state Jun 8, 2022
@fcofdez
Copy link
Contributor Author

fcofdez commented Jun 8, 2022

I opened fcofdez#1 with the DataTierAllocationDecider changes on top of this PR to see what the changes look like and get two separate PRs.

@fcofdez fcofdez marked this pull request as ready for review June 8, 2022 09:56
@fcofdez fcofdez requested a review from henningandersen June 8, 2022 09:57
@fcofdez
Copy link
Contributor Author

fcofdez commented Jun 8, 2022

I'm checking why the upgrade test is failing, but I think the PR is ready for review.

@fcofdez fcofdez added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jun 8, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine
Copy link
Collaborator

Hi @fcofdez, I've created a changelog YAML for you.

@sethmlarson sethmlarson added the Team:Clients Meta label for clients team label Jun 8, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/clients-team (Team:Clients)

fcofdez added 2 commits June 8, 2022 14:49
…ez/elasticsearch into desired-nodes-status-cluster-state
Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read through most of the production code, looks like a good direction.

Comment on lines 249 to 251
if (desiredNodes == null) {
return clusterState;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This null check seems redundant?

return clusterState;
}

final Map<String, DesiredNodeWithStatus> updatedStateDesiredNodes = new HashMap<>(desiredNodes.nodes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this map null initially, initialize it during the loop and use the non-nullness instead of the statusModified flag? Doing so avoids some work for all the cases where there is no change.

if (in.getVersion().onOrAfter(STATUS_TRACKING_SUPPORT_VERSION)) {
status = Status.fromValue(in.readShort());
} else {
status = Status.defaultStatus();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if PENDING is the right status here? I would rather assume that information from previous versions were actualized, since a forever pending node is unexpected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense, and I think we should get a new desired nodes version in this scenario since the cluster is upgraded at that point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One downside of using ACTUALIZED here is that we might end up taking a decision based on made-up information, i.e. a node in certain tier was supposed to be join to the cluster but never did, we might decide to move some shards to that inexistent tier? Maybe I'm overthinking this...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is true. I think we could perhaps fix this by also updating actualized state in JoinTaskExecutor.becomeMasterAndTrimConflictingNodes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already cover that as when we call becomeMasterAndTrimConflictingNodes we set nodesChanged and we update the desired nodes status when some of the nodes have changed.

(ByteSizeValue) args[4],
(Version) args[5]
),
args[6] == null ? Status.defaultStatus() : (Status) args[6]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd find it easier to read to put Status.PENDING here, I am not sure the defaultStatus method adds value? At least I had to go and look up what constant it returns.

I also wonder if this case should use actualized instead. I suppose a master reading this from xcontent would see all nodes join and thus this might be ok here, do you have input to this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning here was that we would end up updating to the proper status as the cluster is upgraded and nodes re-join the cluster (we would reconcile the status at that point), so if there's a missing node we'll still see it as PENDING

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a rolling upgrade, I would think that an upgraded master keeps the list of nodes in cluster state and thus the way we update desired nodes inside the if (nodesChanged) in JoinTaskExecutor will (I think) not work to fix this. The new master will be joined by a number of existing nodes.

We should probably write a test case to demonstrate that this works. A rolling upgrade style test with multiple nodes should not be too bad to write perhaps - we just need to set desired nodes on old version, then rolling upgrade, then check that they are all actualized on new version? Happy to go some other route too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, above is mostly relevant for the stream-reading case. Clearly, if we read from xcontent, we would be joined by all nodes, so this case does sound good. Still would be very nice to test it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably write a test case to demonstrate that this works. A rolling upgrade style test with multiple nodes should not be too bad to write perhaps - we just need to set desired nodes on old version, then rolling upgrade, then check that they are all actualized on new version? Happy to go some other route too.

DesiredNodesUpgradeIT tests that scenario, I've just added the assertion to ensure that all nodes are ACTUALIZED when the cluster is upgraded.

"GET"
]
}
]
},
"params": {
"include_status": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we need this flag? I think we discussed it but do not precisely recall a conclusion.

It sort of increases our BWC surface. And I am in doubt if anyone will ever want to not see the status. An orchestration system should build up their next desired nodes independently of GET _internal/desired_nodes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it doesn't add much value. I added the flag mostly to ensure that upgrades work as expected, so not a very good reason to increase our BWC surface.


@Override
public void onFailure(Exception e) {
if (MasterService.isPublishFailureException(e)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this essential? I am not sure I follow, i.e., if reroute fails for other reasons than publishing, it seems like a bug in our code?

It could simplify all of this a lot if we just succeeded all task-contexts and overrode clusterStatePublished on the executor to do the reroute, what would go wrong if we did this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could simplify all of this a lot if we just succeeded all task-contexts and overrode clusterStatePublished on the executor to do the reroute, what would go wrong if we did this?

I agree, this is a bit complex but it has the nice property of notifying the update desired nodes listeners after the reroute finishes. It's true that this is not super important for orchestrators though, it mostly simplifies our testing code. I'll take your approach.

@fcofdez fcofdez requested a review from henningandersen June 13, 2022 11:38
Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through most of the code and wanted to provide my comments now. This direction looks good though I fear that there is an edge case we are not catching, see comments.

@fcofdez fcofdez requested a review from henningandersen June 15, 2022 07:36
Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few final comments otherwise this looks good to me.

@fcofdez fcofdez requested a review from henningandersen June 15, 2022 18:39
Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for all your efforts on this.

@fcofdez fcofdez merged commit eb8c4ba into elastic:master Jun 16, 2022
@fcofdez
Copy link
Contributor Author

fcofdez commented Jun 16, 2022

Thanks for the review Henning!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Autoscaling >enhancement Team:Clients Meta label for clients team Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.4.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants