Fix decommissioned node not removed from Hosts on missed topology event#203
Fix decommissioned node not removed from Hosts on missed topology event#203dkropachev wants to merge 2 commits into
Conversation
|
How this issue was discovered? I see in the commit description that |
|
CI failures are pre-existing and unrelated to this PR:
net8 and net9 jobs pass cleanly. |
When a ControlConnection reconnects to a new node during a concurrent node decommission, the TOPOLOGY_CHANGE REMOVED_NODE event may have already been sent before the driver registers for events on the new connection. This causes the decommissioned node to remain permanently in the Hosts collection since no further event triggers a refresh. Fix this by scheduling a debounced node list refresh after every successful reconnection, which re-queries system.peers ~1 second later and detects any topology changes missed during the reconnection window. Fixes #202
Add an integration test that creates multiple Cluster objects all connected to the contact point node, then decommissions that node. This forces all control connections to reconnect simultaneously, making it very likely that at least one will miss the TOPOLOGY_CHANGE event. Without the post-reconnection node list refresh, the affected clusters never detect the decommissioned node.
5314c94 to
54947b4
Compare
It is described in the issue, without special hooks to the driver it is impossible. |
Summary
ControlConnectionreconnection to catch missedTOPOLOGY_CHANGEeventsMetadata.Hostswhen theTOPOLOGY_CHANGE REMOVED_NODEevent is sent before the reconnecting CC registers for eventsDetails
During
ControlConnectionreconnection the driver:system.peers(may still contain the decommissioning node)TOPOLOGY_CHANGE,STATUS_CHANGE, etc.)If the
TOPOLOGY_CHANGE REMOVED_NODEevent was already broadcast before step 2 completes, the driver never learns about the decommission. Since there is no periodic node list refresh, the stale host entry persists indefinitely.The fix adds a
ScheduleHostsRefreshAsync()call after successful reconnection, which re-queriessystem.peers~1 second later (via the existing event debouncer) and removes any hosts no longer present.Fixes #202
Test plan
TokenMap_Should_RebuildTokenMap_When_NodeIsDecommissionedpasses consistently