OCPBUGS-60098: podman-etcd: prevent last active member from leaving the etcd member list #2100

clobrano · 2025-11-16T18:43:19Z

When stopping an etcd instance, the agent should not leave the member list if it's the last active agent in the cluster. Leaving the member list in this scenario can cause WAL corruption.

This change introduces a check for the number of active resources before attempting to leave the member list. If no other active resources are found, the agent will log a message and skip the leave operation.

NOTE: the check on standalone_node might not be enough if both agents stop roughly at the same time, hence none of them has enough time to set the attribute.

Fixes: OCPBUGS-60098

knet-jenkins · 2025-11-16T18:44:10Z

Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2100/1/input

fonta-rh · 2025-11-17T09:59:38Z

Is it possible that we want to add a random 0-1 sec delay to account for the case of the two agents leaving at the same time?

knet-jenkins · 2025-11-17T10:35:49Z

Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2100/2/input

knet-jenkins · 2025-11-17T14:56:35Z

Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2100/3/input

clobrano · 2025-11-18T15:22:09Z

Is it possible that we want to add a random 0-1 sec delay to account for the case of the two agents leaving at the same time?

This will be a different way to fix the same problem, indeed.

Option A (Delayed Member Removal): During a simultaneous graceful shutdown, nodes would introduce a delay, ensuring only, and exactly, one node removes itself from the etcd cluster, which bumps the revision on the other etcd member. On restart, Pacemaker ensures both nodes are online before starting the agents (just like Option B). As one of the agents will have higher revision, this option effectively forces the next etcd cycle to create a new cluster.

Option B (No Member Removal): Both nodes would gracefully stop their etcd processes without explicitly leaving the cluster membership. On restart, Pacemaker ensures both nodes are online before starting the agents, just like Option A, however we can't predict which agent will have higher revision, or even if the revisions are equal.

I think both options are good. I only have a slight preference over Option B (the current one).

Option A slows down the stop procedure, requires some more logic to decide the delay and, as it depends on the network connection, might still fail some times.

Option B looks simpler, but it will stress more the "restart normally" (when revisions are the same) branch which is the hardest to test right now, and so, I believe, is less tested.

knet-jenkins · 2025-11-19T09:59:00Z

Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2100/7/input

knet-jenkins · 2025-11-19T16:40:53Z

Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2100/8/input

…he etcd member list When stopping etcd instances, simultaneous member removal from both nodes can corrupt the etcd Write-Ahead Log (WAL). This change implements a two-part solution: 1. Concurrent stop protection: When multiple nodes are stopping, the alphabetically second node delays its member removal by 10 seconds. This prevents simultaneous member list updates that can corrupt WAL. 2. Last member detection: Checks active resource count after any delay. If this is the last active member, skips member removal to avoid leaving an empty cluster. Additionally, reorders podman_stop() to clear the member_id attribute after leaving the member list, ensuring the attribute reflects actual cluster state during shutdown.

knet-jenkins · 2025-11-21T14:25:31Z

Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2100/9/input

clobrano marked this pull request as draft November 16, 2025 18:55

clobrano force-pushed the fix/avoid-wal-corruption-at-podman-stop branch from 9b7568f to 06065c7 Compare November 17, 2025 10:32

clobrano force-pushed the fix/avoid-wal-corruption-at-podman-stop branch from 06065c7 to 2e890df Compare November 17, 2025 14:55

clobrano force-pushed the fix/avoid-wal-corruption-at-podman-stop branch 4 times, most recently from 5a8a9a0 to ce5eff1 Compare November 19, 2025 09:57

clobrano force-pushed the fix/avoid-wal-corruption-at-podman-stop branch from ce5eff1 to f049fb8 Compare November 19, 2025 16:40

clobrano changed the title ~~OCPBUGS-60098: podman-etcd: avoid leaving member list on last active agent~~ OCPBUGS-60098: podman-etcd: prevent last active member from leaving the etcd member list Nov 19, 2025

clobrano marked this pull request as ready for review November 19, 2025 16:41

clobrano requested a review from oalbrigt November 21, 2025 11:16

oalbrigt approved these changes Nov 21, 2025

View reviewed changes

clobrano force-pushed the fix/avoid-wal-corruption-at-podman-stop branch from f049fb8 to 578e6d9 Compare November 21, 2025 14:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

OCPBUGS-60098: podman-etcd: prevent last active member from leaving the etcd member list #2100

OCPBUGS-60098: podman-etcd: prevent last active member from leaving the etcd member list #2100

Uh oh!

clobrano commented Nov 16, 2025

Uh oh!

knet-jenkins bot commented Nov 16, 2025

Uh oh!

fonta-rh commented Nov 17, 2025

Uh oh!

knet-jenkins bot commented Nov 17, 2025

Uh oh!

knet-jenkins bot commented Nov 17, 2025

Uh oh!

clobrano commented Nov 18, 2025

Uh oh!

knet-jenkins bot commented Nov 19, 2025

Uh oh!

knet-jenkins bot commented Nov 19, 2025

Uh oh!

knet-jenkins bot commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

OCPBUGS-60098: podman-etcd: prevent last active member from leaving the etcd member list #2100

Are you sure you want to change the base?

OCPBUGS-60098: podman-etcd: prevent last active member from leaving the etcd member list #2100

Uh oh!

Conversation

clobrano commented Nov 16, 2025

Uh oh!

knet-jenkins bot commented Nov 16, 2025

Uh oh!

fonta-rh commented Nov 17, 2025

Uh oh!

knet-jenkins bot commented Nov 17, 2025

Uh oh!

knet-jenkins bot commented Nov 17, 2025

Uh oh!

clobrano commented Nov 18, 2025

Uh oh!

knet-jenkins bot commented Nov 19, 2025

Uh oh!

knet-jenkins bot commented Nov 19, 2025

Uh oh!

knet-jenkins bot commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants