Skip to content

Conversation

@clobrano
Copy link
Collaborator

When stopping an etcd instance, the agent should not leave the member list if it's the last active agent in the cluster. Leaving the member list in this scenario can cause WAL corruption.

This change introduces a check for the number of active resources before attempting to leave the member list. If no other active resources are found, the agent will log a message and skip the leave operation.

NOTE: the check on standalone_node might not be enough if both agents stop roughly at the same time, hence none of them has enough time to set the attribute.

Fixes: OCPBUGS-60098

@knet-jenkins
Copy link

knet-jenkins bot commented Nov 16, 2025

Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2100/1/input

@clobrano clobrano marked this pull request as draft November 16, 2025 18:55
@fonta-rh
Copy link
Contributor

Is it possible that we want to add a random 0-1 sec delay to account for the case of the two agents leaving at the same time?

@clobrano clobrano force-pushed the fix/avoid-wal-corruption-at-podman-stop branch from 9b7568f to 06065c7 Compare November 17, 2025 10:32
@knet-jenkins
Copy link

knet-jenkins bot commented Nov 17, 2025

Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2100/2/input

@clobrano clobrano force-pushed the fix/avoid-wal-corruption-at-podman-stop branch from 06065c7 to 2e890df Compare November 17, 2025 14:55
@knet-jenkins
Copy link

knet-jenkins bot commented Nov 17, 2025

Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2100/3/input

@clobrano
Copy link
Collaborator Author

Is it possible that we want to add a random 0-1 sec delay to account for the case of the two agents leaving at the same time?

This will be a different way to fix the same problem, indeed.

Option A (Delayed Member Removal): During a simultaneous graceful shutdown, nodes would introduce a delay, ensuring only, and exactly, one node removes itself from the etcd cluster, which bumps the revision on the other etcd member. On restart, Pacemaker ensures both nodes are online before starting the agents (just like Option B). As one of the agents will have higher revision, this option effectively forces the next etcd cycle to create a new cluster.

Option B (No Member Removal): Both nodes would gracefully stop their etcd processes without explicitly leaving the cluster membership. On restart, Pacemaker ensures both nodes are online before starting the agents, just like Option A, however we can't predict which agent will have higher revision, or even if the revisions are equal.

I think both options are good. I only have a slight preference over Option B (the current one).

Option A slows down the stop procedure, requires some more logic to decide the delay and, as it depends on the network connection, might still fail some times.

Option B looks simpler, but it will stress more the "restart normally" (when revisions are the same) branch which is the hardest to test right now, and so, I believe, is less tested.

@clobrano clobrano force-pushed the fix/avoid-wal-corruption-at-podman-stop branch 4 times, most recently from 5a8a9a0 to ce5eff1 Compare November 19, 2025 09:57
@knet-jenkins
Copy link

knet-jenkins bot commented Nov 19, 2025

Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2100/7/input

@clobrano clobrano force-pushed the fix/avoid-wal-corruption-at-podman-stop branch from ce5eff1 to f049fb8 Compare November 19, 2025 16:40
@knet-jenkins
Copy link

knet-jenkins bot commented Nov 19, 2025

Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2100/8/input

@clobrano clobrano changed the title OCPBUGS-60098: podman-etcd: avoid leaving member list on last active agent OCPBUGS-60098: podman-etcd: prevent last active member from leaving the etcd member list Nov 19, 2025
@clobrano clobrano marked this pull request as ready for review November 19, 2025 16:41
@clobrano clobrano requested a review from oalbrigt November 21, 2025 11:16
…he etcd member list

When stopping etcd instances, simultaneous member removal from both
nodes can corrupt the etcd Write-Ahead Log (WAL). This change implements
a two-part solution:

1. Concurrent stop protection: When multiple nodes are stopping, the
   alphabetically second node delays its member removal by 10
   seconds. This prevents simultaneous member list updates that can
   corrupt WAL.

2. Last member detection: Checks active resource count after any
   delay. If this is the last active member, skips member removal to
   avoid leaving an empty cluster.

Additionally, reorders podman_stop() to clear the member_id attribute
after leaving the member list, ensuring the attribute reflects actual
cluster state during shutdown.
@clobrano clobrano force-pushed the fix/avoid-wal-corruption-at-podman-stop branch from f049fb8 to 578e6d9 Compare November 21, 2025 14:24
@knet-jenkins
Copy link

knet-jenkins bot commented Nov 21, 2025

Can one of the project admins check and authorise this run please: https://ci.kronosnet.org/job/resource-agents/job/resource-agents-pipeline/job/PR-2100/9/input

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants