Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CORE-5766 Validate target node id when collecting health report #22811

Merged

Conversation

mmaslankaprv
Copy link
Member

@mmaslankaprv mmaslankaprv commented Aug 9, 2024

The health report is used to determine if a cluster node is online and
available. When a node id changes but the RPC endpoint does not change
the requester may incorrectly assume that the node with the previous
node_id but the same endpoint is still operational. Added validation of
the node that the request was sent to before collecting the health
report. This way a sender will have correct information about the node
availability as only the request targeted to the node with the correct
node id will be replied with success.

Fixes: CORE-5766

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.2.x
  • v24.1.x
  • v23.3.x

Release Notes

Bug Fixes

  • fixes problem of a node being reported as alive after its node_id has changed

Introduced an error code that indicates the node that the request was
sent to is not the one that received it.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
Added validation that checks if the node replying request is the one the
request was sent to. The validation is important as the receiving node
id might have changed while the RPC endpoint address stays the same.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
@bashtanov
Copy link
Contributor

Is this problem specific to health reports, or can we hit the same problem with any inter-node RPCs? I'm just thinking maybe the check should be included into RPC mechanism, and an RPC call is not even to be processed by the remote node if it is not the original addressee?

@mmaslankaprv
Copy link
Member Author

Is this problem specific to health reports, or can we hit the same problem with any inter-node RPCs? I'm just thinking maybe the check should be included into RPC mechanism, and an RPC call is not even to be processed by the remote node if it is not the original addressee?

you are right, this is a problem with the RPC mechanism in general, we were thinking about adding a handshake to perform a validation, definitely should be scheduled for the future work. I didn't do it right now as the solution would be much more complex and not easy to backport. The problem with the health is real and this PR is solving that in isolation

src/v/cluster/health_monitor_types.h Outdated Show resolved Hide resolved
if (!reply) {
return {reply.error()};
}
if (!reply.value().report.has_value()) {
return {reply.value().error};
}
if (reply.value().report->id != target_node_id) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of rechecking it here, if we anyway have checked it on the server side in service::do_collect_node_health_report? I guess it'll only make sense for some corner cases when nodes run different versions of Redpanda?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly

src/v/cluster/node_status_backend.cc Show resolved Hide resolved
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Aug 9, 2024

Added a field indicating what node the request was targeted to. If
present the `target_node_id` will be validated when processing the
request.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
The health report is used to determine if a cluster node is online and
available. When a node id changes but the RPC endpoint does not change
the requester may incorrectly assume that the node with the previous
node_id but the same endpoint is still operational. Added validation of
the node that the request was sent to before collecting the health
report. This way a sender will have correct information about the node
availability as only the request targeted to the node with the correct
node id will be replied with success.

Fixes: CORE-5766

Signed-off-by: Michał Maślanka <michal@redpanda.com>
The node folder deletion test checks if a node joins the cluster with
the new node id after its data folder was deleted. Introduced a new
validation checking if in this case the node with the old node_id
is reported as offline

Signed-off-by: Michał Maślanka <michal@redpanda.com>
Added validation of the node_id of the reply received from the node. The
report is not considered as valid if the reply node id doesn't match the
id of node the report was sent to.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
@mmaslankaprv mmaslankaprv force-pushed the CORE-5766-health-report-node-id branch from 81313db to 08de93d Compare August 9, 2024 14:43
Copy link
Contributor

@bashtanov bashtanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you know you actually need it to work with only some nodes upgraded I'm approving it. But I'm still curious why in real life someone would upgrade only some nodes for a long time, or, if the cluster is not going to stay mixed for long, why would anyone care whether the bug is fixed when the first node is upgraded or the last one.

@piyushredpanda piyushredpanda merged commit 59e4e6e into redpanda-data:dev Aug 10, 2024
20 checks passed
@piyushredpanda piyushredpanda removed this from the v24.2.x-next milestone Aug 10, 2024
@vbotbuildovich
Copy link
Collaborator

/backport v24.2.x

@vbotbuildovich
Copy link
Collaborator

/backport v24.1.x

@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v23.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-22811-v23.3.x-187 remotes/upstream/v23.3.x
git cherry-pick -x 221a0b76e7470e0b4b50d492e5b414f95b48906d c514c9e8d7c8e2eebdc338c36832c4f9ed5464c0 7886aec1c34c34af7664dcd26799032cce65aa4a 90eafa83727cdd2f8ab3bdd7db4c65b2c3c50cbc 6a8f39079b676d4af6ffc808419213ec51b0bcda 08de93db183bf9f596e19073b906a1afae602af0

Workflow run logs.

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v24.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-22811-v24.1.x-428 remotes/upstream/v24.1.x
git cherry-pick -x 221a0b76e7470e0b4b50d492e5b414f95b48906d c514c9e8d7c8e2eebdc338c36832c4f9ed5464c0 7886aec1c34c34af7664dcd26799032cce65aa4a 90eafa83727cdd2f8ab3bdd7db4c65b2c3c50cbc 6a8f39079b676d4af6ffc808419213ec51b0bcda 08de93db183bf9f596e19073b906a1afae602af0

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants