-
Notifications
You must be signed in to change notification settings - Fork 25.4k
Add a cluster listener to fix missing cluster features after upgrade #110710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a cluster listener to fix missing cluster features after upgrade #110710
Conversation
This doesn't have an integration test attached, as our existing upgrade tests don't have non-master nodes in them. |
.nodes() | ||
.stream() | ||
.map(DiscoveryNode::getId) | ||
.filter(n -> nodeVersions.get(n).transportVersion().onOrAfter(TransportVersions.NODE_FEATURES_QUERY_ACTION)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is not actually reliable for upgrading from versions before 8.8.0, due to the transport version being inferred to start with. This would require this listener to run strictly after TransportVersionsFixupListener
has completed, but I don't think it's possible to add a dependency in this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So our options here are:
- Combine this and
TransportVersionFixupListener
into a complicated multi-step async listener thing that does the upgrade operations that are required in strict order (whilst dealing with failures, retries, etc) - Use
Version
instead ofTransportVersion
. This is actually ok, because this case will never be hit on Serverless (nodes will always have features), so the action that may or may not be there will never get called. This call will only happen on on-prem installs, where the Version is applicable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using Version
(option 2.) seems ok to me as well in this case.
Alternatively we could listen for the fix implemented by NodeFeaturesFixListener
listener in addition to make sure the fix has been applied.
Pinging @elastic/es-core-infra (Team:Core/Infra) |
Hi @thecoop, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid my review is fairly superficial.
This PR seems like a great one to bookmark for the next person who wants to add a transport action.
server/src/main/java/org/elasticsearch/cluster/features/NodeFeaturesFixupListener.java
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/cluster/features/NodeFeaturesFixupListener.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I have a couple of questions
import java.util.List; | ||
|
||
@UpdateForV9 | ||
// @UpdateForV10 // this can be removed in v10. It may be called by v8 nodes to v9 nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR, but should we already create @UpdateForV10?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably - or just create it as part of the v9 migration
for (var c : context.taskContexts()) { | ||
for (var e : c.getTask().results().entrySet()) { | ||
// double check there are still no features for the node | ||
if (existingFeatures.getOrDefault(e.getKey(), Set.of()).isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this needed?
Should not we merge the maps in any case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more being extra-safe - in the case the node has left and re-joined in the meantime it will have the features in there already, which may be different to those that were queried before the node was shut down.
server/src/main/java/org/elasticsearch/cluster/features/NodeFeaturesFixupListener.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…lastic#110710) Non-master-eligible nodes that are already part of a cluster when the master is upgraded don't re-join the cluster, so their cluster features never get updated. This adds a cluster listener that spots this occurring, and manually gets the node's features with a new transport action and updates the cluster state after the fact.
💚 Backport successful
|
…110710) (#110924) Non-master-eligible nodes that are already part of a cluster when the master is upgraded don't re-join the cluster, so their cluster features never get updated. This adds a cluster listener that spots this occurring, and manually gets the node's features with a new transport action and updates the cluster state after the fact.
…er upgrade (#115771) This PR modifies `TransportVersionsFixupListener` to include all of compatibility versions (not only TransportVersion) in the fixup. `TransportVersionsFixupListener` spots the instances when the master has been upgraded to the most recent code version, along with non-master nodes, but some nodes are missing a "proper" (non-inferred) Transport version. This PR adds another check to also ensure that we have real (non-empty) system index mapping versions. To do so, it modifies NodeInfo so it carries all of CompatibilityVersions (TransportVersion + SystemIndexDescriptor.MappingVersions). This was initially done via a separate fixup listener + ad-hoc transport action, but the 2 listeners "raced" to update ClusterState on the same CompatibilityVersions structure; it just made sense to do it at the same time. The fixup is very similar to #110710, which does the same for cluster features; plus, it adds a CI test to cover the bug raised in #112694 Closes #112694
The cluster listener spots the instances when the master has been upgraded to the most recent code version, along with non-master nodes, but some nodes are missing cluster features
Fixes #109254