cli: fully and properly drain target node of decommission #141411

tbg · 2025-02-13T09:04:19Z

With this patch, at the end of decommissioning, we call the drain step as we
would for ./cockroach node drain:

[...]
.....
id	is_live	replicas	is_decommissioning	membership	is_draining	readiness	blocking_ranges
1	true	2	true	decommissioning	false	ready	0
.....
id	is_live	replicas	is_decommissioning	membership	is_draining	readiness	blocking_ranges
1	true	1	true	decommissioning	false	ready	0
......
id	is_live	replicas	is_decommissioning	membership	is_draining	readiness	blocking_ranges
1	true	0	true	decommissioning	false	ready	0
draining node n2
node is draining... remaining: 26
node is draining... remaining: 0 (complete)
node n2 drained successfully

No more data reported on target nodes. Please verify cluster health before removing the nodes.

In particular, note how the first invocation returns a RemainingIndicator of
26, so before this patch, we had initiated draining, but it hadn't fully completed.

I thought for a while that this could explain #140774, i.e. that #138732 was
insufficient as it did not guarantee that the node had actually drained fully
by the time it was marked as fully decommissioned and the node decommission
had returned. But I found that fully draining did not fix the test, and
ultimately tracked the issue down to a test infra problem. Still, this PR is
a good change, that brings the drain experience in decommission on par with
the standalone CLI.

See #140774.

I verified that the modified decommission/drains roachtest passes via

./pkg/cmd/roachtest/roachstress.sh -l -c 1 decommission/drains/alive

Touches #140774.
Touches #139411.
Touches #139413.
Closes #140098 (since we no longer fail decommission on drain failure)

PR #138732 already fixed most of the drain issues, but since the
decommissioning process still went ahead and shut the node out
from the cluster, SQL connections that drain was still waiting
for would likely hit errors (since the gateway node would not
be able to connect to the rest of the cluster any more due to
having been flipped to fully decommissioned). So there's a new
release note for the improvement in this PR, which avoids that.

We should consider backporting this new set of changes to 25.1
to address flakes (and the corresponding poor UX that could occur
in production) such as #141578 (comment).

Release note (bug fix): previously, a node that was drained as part
of decommissioning may have interrupted SQL connections that were
still active during drain (and for which drain would have been
expected to wait).
Epic: None

cockroach-teamcity · 2025-02-13T09:04:29Z

This change is

With this patch, at the end of decommissioning, we call the drain step as we would for `./cockroach node drain`: ``` [...] ..... id is_live replicas is_decommissioning membership is_draining readiness blocking_ranges 1 true 2 true decommissioning false ready 0 ..... id is_live replicas is_decommissioning membership is_draining readiness blocking_ranges 1 true 1 true decommissioning false ready 0 ...... id is_live replicas is_decommissioning membership is_draining readiness blocking_ranges 1 true 0 true decommissioning false ready 0 draining node n2 node is draining... remaining: 26 node is draining... remaining: 0 (complete) node n2 drained successfully No more data reported on target nodes. Please verify cluster health before removing the nodes. ``` In particular, note how the first invocation returns a RemainingIndicator of 26, so before this patch, we had initiated draining, but it hadn't fully completed. I thought for a while that this could explain cockroachdb#140774, i.e. that cockroachdb#138732 was insufficient as it did not guarantee that the node had actually drained fully by the time it was marked as fully decommissioned and the `node decommission` had returned. But I found that fully draining did not fix the test, and ultimately tracked the issue down to a test infra problem. Still, this PR is a good change, that brings the drain experience in decommission on par with the standalone CLI. See cockroachdb#140774. I verified that the modified decommission/drains roachtest passes via ``` ./pkg/cmd/roachtest/roachstress.sh -l -c 1 decommission/drains/alive ``` Touches cockroachdb#140774. Touches cockroachdb#139411. Touches cockroachdb#139413. PR cockroachdb#138732 already fixed most of the drain issues, but since the decommissioning process still went ahead and shut the node out from the cluster, SQL connections that drain was still waiting for would likely hit errors (since the gateway node would not be able to connect to the rest of the cluster any more due to having been flipped to fully decommissioned). So there's a new release note for the improvement in this PR, which avoids that. Release note (bug fix): previously, a node that was drained as part of decommissioning may have interrupted SQL connections that were still active during drain (and for which drain would have been expected to wait). Epic: None

arulajmani

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @tbg)

tbg · 2025-02-20T08:36:20Z

TFTR!

bors r+

craig · 2025-02-20T09:08:05Z

Build succeeded:

blathers-crl · 2025-02-20T09:08:12Z

Based on the specified backports for this PR, I applied new labels to the following linked issue(s). Please adjust the labels as needed to match the branches actually affected by the issue(s), including adding any known older branches.

Issue #140098: branch-release-25.1.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

tbg force-pushed the drain-fix-for-reals branch 2 times, most recently from cfb1b1e to 7700799 Compare February 13, 2025 11:12

tbg mentioned this pull request Feb 13, 2025

roachtest: decommission/mixed-versions failed #140774

Closed

tbg force-pushed the drain-fix-for-reals branch from 7700799 to 5319f88 Compare February 13, 2025 11:35

tbg marked this pull request as ready for review February 17, 2025 15:50

tbg mentioned this pull request Feb 17, 2025

roachtest: decommission/nodes=4/duration=1h0m0s failed #141578

Open

tbg added the backport-25.1.x Flags PRs that need to be backported to 25.1 label Feb 17, 2025

tbg requested a review from arulajmani February 17, 2025 15:52

This was referenced Feb 17, 2025

roachtest: decommission/mixed-versions failed #141537

Open

roachtest: decommission/randomized failed #140098

Closed

arulajmani approved these changes Feb 18, 2025

View reviewed changes

craig bot merged commit 52b1df0 into cockroachdb:master Feb 20, 2025
24 checks passed

celeste-cockroachdb bot added the target-release-25.2.0 label Feb 20, 2025

blathers-crl bot mentioned this pull request Feb 20, 2025

release-25.1: cli: fully and properly drain target node of decommission #141769

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cli: fully and properly drain target node of decommission #141411

cli: fully and properly drain target node of decommission #141411

tbg commented Feb 13, 2025 •

edited

Loading

cockroach-teamcity commented Feb 13, 2025

arulajmani left a comment

tbg commented Feb 20, 2025

craig bot commented Feb 20, 2025

blathers-crl bot commented Feb 20, 2025

cli: fully and properly drain target node of decommission #141411

cli: fully and properly drain target node of decommission #141411

Conversation

tbg commented Feb 13, 2025 • edited Loading

cockroach-teamcity commented Feb 13, 2025

arulajmani left a comment

Choose a reason for hiding this comment

tbg commented Feb 20, 2025

craig bot commented Feb 20, 2025

blathers-crl bot commented Feb 20, 2025

tbg commented Feb 13, 2025 •

edited

Loading