Skip to content

Fix decommissioned node not removed from Hosts on missed topology event#203

Draft
dkropachev wants to merge 2 commits into
masterfrom
fix/missed-topology-change-on-reconnect
Draft

Fix decommissioned node not removed from Hosts on missed topology event#203
dkropachev wants to merge 2 commits into
masterfrom
fix/missed-topology-change-on-reconnect

Conversation

@dkropachev
Copy link
Copy Markdown
Collaborator

@dkropachev dkropachev commented Mar 11, 2026

Summary

  • Schedule a debounced node list refresh after every successful ControlConnection reconnection to catch missed TOPOLOGY_CHANGE events
  • Fixes a race condition where a decommissioned node permanently stays in Metadata.Hosts when the TOPOLOGY_CHANGE REMOVED_NODE event is sent before the reconnecting CC registers for events

Details

During ControlConnection reconnection the driver:

  1. Connects to a new node and queries system.peers (may still contain the decommissioning node)
  2. Registers for server events (TOPOLOGY_CHANGE, STATUS_CHANGE, etc.)

If the TOPOLOGY_CHANGE REMOVED_NODE event was already broadcast before step 2 completes, the driver never learns about the decommission. Since there is no periodic node list refresh, the stale host entry persists indefinitely.

The fix adds a ScheduleHostsRefreshAsync() call after successful reconnection, which re-queries system.peers ~1 second later (via the existing event debouncer) and removes any hosts no longer present.

Fixes #202

Test plan

  • Verify TokenMap_Should_RebuildTokenMap_When_NodeIsDecommissioned passes consistently
  • Verify no regressions in other topology change tests
  • Verify the delayed refresh does not interfere with normal reconnection flow

@dkropachev dkropachev self-assigned this Mar 11, 2026
@sylwiaszunejko
Copy link
Copy Markdown

How this issue was discovered? I see in the commit description that TokenMap_Should_RebuildTokenMap_When_NodeIsDecommissioned was impacted but I don't see it failing or being flaky in CI or driver matrix.
Can we have a test that is failing without the fix but is working now? Or somehow modify TokenMap_Should_RebuildTokenMap_When_NodeIsDecommissioned to fail if the issue is observed?

Comment on lines +197 to +202
catch (Exception ex)
{
Trace.Flush();
Assert.Fail("Exception: " + ex + Environment.NewLine +
string.Join(Environment.NewLine, listener.Queue.ToArray()));
}
@dkropachev
Copy link
Copy Markdown
Collaborator Author

CI failures are pre-existing and unrelated to this PR:

net8 and net9 jobs pass cleanly.

When a ControlConnection reconnects to a new node during a concurrent
node decommission, the TOPOLOGY_CHANGE REMOVED_NODE event may have
already been sent before the driver registers for events on the new
connection. This causes the decommissioned node to remain permanently
in the Hosts collection since no further event triggers a refresh.

Fix this by scheduling a debounced node list refresh after every
successful reconnection, which re-queries system.peers ~1 second
later and detects any topology changes missed during the reconnection
window.

Fixes #202
Add an integration test that creates multiple Cluster objects all connected
to the contact point node, then decommissions that node. This forces all
control connections to reconnect simultaneously, making it very likely that
at least one will miss the TOPOLOGY_CHANGE event. Without the post-reconnection
node list refresh, the affected clusters never detect the decommissioned node.
@dkropachev dkropachev force-pushed the fix/missed-topology-change-on-reconnect branch from 5314c94 to 54947b4 Compare March 11, 2026 15:09
@dkropachev
Copy link
Copy Markdown
Collaborator Author

dkropachev commented Mar 18, 2026

How this issue was discovered? I see in the commit description that TokenMap_Should_RebuildTokenMap_When_NodeIsDecommissioned was impacted but I don't see it failing or being flaky in CI or driver matrix. Can we have a test that is failing without the fix but is working now? Or somehow modify TokenMap_Should_RebuildTokenMap_When_NodeIsDecommissioned to fail if the issue is observed?

It is described in the issue, without special hooks to the driver it is impossible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Decommissioned node not removed from Hosts when TOPOLOGY_CHANGE event is missed during ControlConnection reconnection

2 participants