CORE-5766 Validate target node id when collecting health report #22811

mmaslankaprv · 2024-08-09T09:59:06Z

The health report is used to determine if a cluster node is online and
available. When a node id changes but the RPC endpoint does not change
the requester may incorrectly assume that the node with the previous
node_id but the same endpoint is still operational. Added validation of
the node that the request was sent to before collecting the health
report. This way a sender will have correct information about the node
availability as only the request targeted to the node with the correct
node id will be replied with success.

Fixes: CORE-5766

Backports Required

Release Notes

Bug Fixes

fixes problem of a node being reported as alive after its node_id has changed

Introduced an error code that indicates the node that the request was sent to is not the one that received it. Signed-off-by: Michał Maślanka <michal@redpanda.com>

Added validation that checks if the node replying request is the one the request was sent to. The validation is important as the receiving node id might have changed while the RPC endpoint address stays the same. Signed-off-by: Michał Maślanka <michal@redpanda.com>

bashtanov · 2024-08-09T10:29:47Z

Is this problem specific to health reports, or can we hit the same problem with any inter-node RPCs? I'm just thinking maybe the check should be included into RPC mechanism, and an RPC call is not even to be processed by the remote node if it is not the original addressee?

mmaslankaprv · 2024-08-09T10:43:22Z

Is this problem specific to health reports, or can we hit the same problem with any inter-node RPCs? I'm just thinking maybe the check should be included into RPC mechanism, and an RPC call is not even to be processed by the remote node if it is not the original addressee?

you are right, this is a problem with the RPC mechanism in general, we were thinking about adding a handshake to perform a validation, definitely should be scheduled for the future work. I didn't do it right now as the solution would be much more complex and not easy to backport. The problem with the health is real and this PR is solving that in isolation

src/v/cluster/health_monitor_types.h

bashtanov · 2024-08-09T12:05:11Z

src/v/cluster/health_monitor_backend.cc

    if (!reply) {
        return {reply.error()};
    }
    if (!reply.value().report.has_value()) {
        return {reply.value().error};
    }
+    if (reply.value().report->id != target_node_id) {


What's the point of rechecking it here, if we anyway have checked it on the server side in service::do_collect_node_health_report? I guess it'll only make sense for some corner cases when nodes run different versions of Redpanda?

src/v/cluster/node_status_backend.cc

vbotbuildovich · 2024-08-09T12:42:26Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/52703#019136d9-1f11-4b31-b4d3-341497bdcf08

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/52718#019137f3-b300-440f-a498-586e2c4611a7

Added a field indicating what node the request was targeted to. If present the `target_node_id` will be validated when processing the request. Signed-off-by: Michał Maślanka <michal@redpanda.com>

The health report is used to determine if a cluster node is online and available. When a node id changes but the RPC endpoint does not change the requester may incorrectly assume that the node with the previous node_id but the same endpoint is still operational. Added validation of the node that the request was sent to before collecting the health report. This way a sender will have correct information about the node availability as only the request targeted to the node with the correct node id will be replied with success. Fixes: CORE-5766 Signed-off-by: Michał Maślanka <michal@redpanda.com>

The node folder deletion test checks if a node joins the cluster with the new node id after its data folder was deleted. Introduced a new validation checking if in this case the node with the old node_id is reported as offline Signed-off-by: Michał Maślanka <michal@redpanda.com>

Added validation of the node_id of the reply received from the node. The report is not considered as valid if the reply node id doesn't match the id of node the report was sent to. Signed-off-by: Michał Maślanka <michal@redpanda.com>

bashtanov

Since you know you actually need it to work with only some nodes upgraded I'm approving it. But I'm still curious why in real life someone would upgrade only some nodes for a long time, or, if the cluster is not going to stay mixed for long, why would anyone care whether the bug is fixed when the first node is upgraded or the last one.

vbotbuildovich · 2024-08-10T19:41:24Z

/backport v24.2.x

vbotbuildovich · 2024-08-10T19:41:25Z

/backport v24.1.x

vbotbuildovich · 2024-08-10T19:41:25Z

/backport v23.3.x

vbotbuildovich · 2024-08-10T19:42:26Z

Failed to create a backport PR to v23.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-22811-v23.3.x-187 remotes/upstream/v23.3.x
git cherry-pick -x 221a0b76e7470e0b4b50d492e5b414f95b48906d c514c9e8d7c8e2eebdc338c36832c4f9ed5464c0 7886aec1c34c34af7664dcd26799032cce65aa4a 90eafa83727cdd2f8ab3bdd7db4c65b2c3c50cbc 6a8f39079b676d4af6ffc808419213ec51b0bcda 08de93db183bf9f596e19073b906a1afae602af0

Workflow run logs.

vbotbuildovich · 2024-08-10T19:42:27Z

Failed to create a backport PR to v24.1.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-22811-v24.1.x-428 remotes/upstream/v24.1.x
git cherry-pick -x 221a0b76e7470e0b4b50d492e5b414f95b48906d c514c9e8d7c8e2eebdc338c36832c4f9ed5464c0 7886aec1c34c34af7664dcd26799032cce65aa4a 90eafa83727cdd2f8ab3bdd7db4c65b2c3c50cbc 6a8f39079b676d4af6ffc808419213ec51b0bcda 08de93db183bf9f596e19073b906a1afae602af0

Workflow run logs.

mmaslankaprv added 2 commits August 9, 2024 09:41

c/errc: introduced new error code indicating targeting invalid node

221a0b7

Introduced an error code that indicates the node that the request was sent to is not the one that received it. Signed-off-by: Michał Maślanka <michal@redpanda.com>

github-actions bot added the area/redpanda label Aug 9, 2024

mmaslankaprv requested review from bashtanov, ztlpn and bharathv August 9, 2024 09:59

bashtanov reviewed Aug 9, 2024

View reviewed changes

mmaslankaprv modified the milestones: v24.1.x-next, v24.2.x-next Aug 9, 2024

mmaslankaprv added 4 commits August 9, 2024 14:43

c/health: added target node id to get_node_health_request

7886aec

Added a field indicating what node the request was targeted to. If present the `target_node_id` will be validated when processing the request. Signed-off-by: Michał Maślanka <michal@redpanda.com>

mmaslankaprv force-pushed the CORE-5766-health-report-node-id branch from 81313db to 08de93d Compare August 9, 2024 14:43

bashtanov approved these changes Aug 9, 2024

View reviewed changes

piyushredpanda merged commit 59e4e6e into redpanda-data:dev Aug 10, 2024
20 checks passed

piyushredpanda removed this from the v24.2.x-next milestone Aug 10, 2024

vbotbuildovich mentioned this pull request Aug 10, 2024

[v24.2.x] CORE-5766 Validate target node id when collecting health report #22835

Merged

This was referenced Aug 10, 2024

[v23.3.x] CORE-5766 Validate target node id when collecting health report #22836

Closed

[v24.1.x] CORE-5766 Validate target node id when collecting health report #22837

Closed

mmaslankaprv mentioned this pull request Aug 14, 2024

[v23.3.x] CORE-5766 Validate target node id when collecting health report #22886

Merged

mmaslankaprv mentioned this pull request Aug 16, 2024

[v24.1.x] CORE-5766 Validate target node id when collecting health report #22910

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CORE-5766 Validate target node id when collecting health report #22811

CORE-5766 Validate target node id when collecting health report #22811

mmaslankaprv commented Aug 9, 2024 •

edited

Loading

bashtanov commented Aug 9, 2024

mmaslankaprv commented Aug 9, 2024

bashtanov Aug 9, 2024

mmaslankaprv Aug 9, 2024

vbotbuildovich commented Aug 9, 2024 •

edited

Loading

bashtanov left a comment

vbotbuildovich commented Aug 10, 2024

vbotbuildovich commented Aug 10, 2024

vbotbuildovich commented Aug 10, 2024

vbotbuildovich commented Aug 10, 2024

vbotbuildovich commented Aug 10, 2024

CORE-5766 Validate target node id when collecting health report #22811

CORE-5766 Validate target node id when collecting health report #22811

Conversation

mmaslankaprv commented Aug 9, 2024 • edited Loading

Backports Required

Release Notes

Bug Fixes

bashtanov commented Aug 9, 2024

mmaslankaprv commented Aug 9, 2024

bashtanov Aug 9, 2024

Choose a reason for hiding this comment

mmaslankaprv Aug 9, 2024

Choose a reason for hiding this comment

vbotbuildovich commented Aug 9, 2024 • edited Loading

bashtanov left a comment

Choose a reason for hiding this comment

vbotbuildovich commented Aug 10, 2024

vbotbuildovich commented Aug 10, 2024

vbotbuildovich commented Aug 10, 2024

vbotbuildovich commented Aug 10, 2024

vbotbuildovich commented Aug 10, 2024

mmaslankaprv commented Aug 9, 2024 •

edited

Loading

vbotbuildovich commented Aug 9, 2024 •

edited

Loading