Skip to content

Add a cluster listener to fix missing cluster features after upgrade #110710

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Jul 16, 2024

Conversation

thecoop
Copy link
Member

@thecoop thecoop commented Jul 10, 2024

The cluster listener spots the instances when the master has been upgraded to the most recent code version, along with non-master nodes, but some nodes are missing cluster features

Fixes #109254

@thecoop
Copy link
Member Author

thecoop commented Jul 10, 2024

This doesn't have an integration test attached, as our existing upgrade tests don't have non-master nodes in them.

.nodes()
.stream()
.map(DiscoveryNode::getId)
.filter(n -> nodeVersions.get(n).transportVersion().onOrAfter(TransportVersions.NODE_FEATURES_QUERY_ACTION))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is not actually reliable for upgrading from versions before 8.8.0, due to the transport version being inferred to start with. This would require this listener to run strictly after TransportVersionsFixupListener has completed, but I don't think it's possible to add a dependency in this way.

Copy link
Member Author

@thecoop thecoop Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So our options here are:

  1. Combine this and TransportVersionFixupListener into a complicated multi-step async listener thing that does the upgrade operations that are required in strict order (whilst dealing with failures, retries, etc)
  2. Use Version instead of TransportVersion. This is actually ok, because this case will never be hit on Serverless (nodes will always have features), so the action that may or may not be there will never get called. This call will only happen on on-prem installs, where the Version is applicable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Version (option 2.) seems ok to me as well in this case.
Alternatively we could listen for the fix implemented by NodeFeaturesFixListener listener in addition to make sure the fix has been applied.

@thecoop thecoop added auto-backport Automatically create backport pull requests when merged v8.15.0 and removed WIP labels Jul 11, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@elasticsearchmachine elasticsearchmachine added the Team:Core/Infra Meta label for core/infra team label Jul 11, 2024
@elasticsearchmachine
Copy link
Collaborator

Hi @thecoop, I've created a changelog YAML for you.

Copy link
Contributor

@prdoyle prdoyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid my review is fairly superficial.

This PR seems like a great one to bookmark for the next person who wants to add a transport action.

Copy link
Contributor

@ldematte ldematte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I have a couple of questions

import java.util.List;

@UpdateForV9
// @UpdateForV10 // this can be removed in v10. It may be called by v8 nodes to v9 nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, but should we already create @UpdateForV10?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably - or just create it as part of the v9 migration

for (var c : context.taskContexts()) {
for (var e : c.getTask().results().entrySet()) {
// double check there are still no features for the node
if (existingFeatures.getOrDefault(e.getKey(), Set.of()).isEmpty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?
Should not we merge the maps in any case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more being extra-safe - in the case the node has left and re-joined in the meantime it will have the features in there already, which may be different to those that were queried before the node was shut down.

@thecoop thecoop requested a review from a team July 15, 2024 09:22
Copy link
Contributor

@ldematte ldematte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thecoop thecoop merged commit e64aab1 into elastic:main Jul 16, 2024
15 checks passed
@thecoop thecoop deleted the cluster-features-fixup-listener branch July 16, 2024 13:33
thecoop added a commit to thecoop/elasticsearch that referenced this pull request Jul 16, 2024
…lastic#110710)

Non-master-eligible nodes that are already part of a cluster when the master is upgraded don't re-join the cluster, so their cluster features never get updated. This adds a cluster listener that spots this occurring, and manually gets the node's features with a new transport action and updates the cluster state after the fact.
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.15

elasticsearchmachine pushed a commit that referenced this pull request Jul 16, 2024
…110710) (#110924)

Non-master-eligible nodes that are already part of a cluster when the master is upgraded don't re-join the cluster, so their cluster features never get updated. This adds a cluster listener that spots this occurring, and manually gets the node's features with a new transport action and updates the cluster state after the fact.
elasticsearchmachine pushed a commit that referenced this pull request Nov 11, 2024
…er upgrade (#115771)

This PR modifies `TransportVersionsFixupListener` to include all of
compatibility versions (not only TransportVersion) in the fixup.

`TransportVersionsFixupListener` spots the instances when the master has
been upgraded to the most recent code version, along with non-master
nodes, but some nodes are missing a "proper" (non-inferred) Transport
version. This PR adds another check to also ensure that we have real
(non-empty) system index mapping versions.

To do so, it modifies NodeInfo so it carries all of
CompatibilityVersions (TransportVersion +
SystemIndexDescriptor.MappingVersions).

This was initially done via a separate fixup listener + ad-hoc transport
action, but the 2 listeners "raced" to update ClusterState on the same
CompatibilityVersions structure; it just made sense to do it at the same
time.

The fixup is very similar to
#110710, which does the
same for cluster features; plus, it adds a CI test to cover the bug
raised in #112694

Closes #112694
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged >bug :Core/Infra/Core Core issues without another label Team:Core/Infra Meta label for core/infra team v8.15.0 v8.16.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Features not matching version after an upgrade to 8.13+
5 participants