Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 19 additions & 14 deletions heartbeat/podman-etcd
Original file line number Diff line number Diff line change
Expand Up @@ -880,7 +880,7 @@ add_member_as_learner()
local endpoint_url=$(ip_url $(attribute_node_ip get))
local peer_url=$(ip_url $member_ip)

ocf_log info "add $member_name ($member_ip, $endpoint_url) to the member list as learner"
ocf_log info "add $member_name ($member_ip) to the member list as learner"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be checking if it's already a learner before we we add it as one? I see this error sometimes with the debug-start command, where one of the nodes is a learner, and thus add_member_as_learner fails when it is added again

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I shouldn't even try to add a new member if it finds it in the member list

Copy link
Collaborator Author

@clobrano clobrano Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It calls add_member_as_learner in two cases

  • in podman_etcd, when it just forced a new cluster, hence there can't be any learner
  • in monitor, when there is no peer in the member list.

I never tried with debug-start and force-new-cluster

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue in debug-start would be the first case, which means it just forced a new cluster. Could be a race condition, since I only see it in CI.

out=$(podman exec "${CONTAINER}" etcdctl --endpoints="$endpoint_url:2379" member add "$member_name" --peer-urls="$peer_url:2380" --learner)
rc=$?
if [ $rc -ne 0 ]; then
Expand Down Expand Up @@ -1032,7 +1032,7 @@ promote_learner_member()
if ! ocf_run podman exec "${CONTAINER}" etcdctl member promote "$learner_member_id_hex" 2>&1; then
# promotion is expected to fail if the peer is not yet up-to-date
ocf_log info "could not promote member $learner_member_id_hex, error code: $?"
return $OCF_SUCCESS
return $OCF_ERR_GENERIC

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we treat this as an error now? I know we need to retry this later, but as the comment says - it is OK for this to fail if we just not ready yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should update the code to react to the rc if we return an error here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we treat this as an error now?

I made this change to clean up standalone_node and learner_node attributes immediately after promotion. See https://github.com/clobrano/resource-agents/blob/99d36fa651ce8f3aadde818de166955f48a680d5/heartbeat/podman-etcd#L1065-L1079

The problem I wanted to address was that the attributes were not cleaned up immediately, but in the next monitor loop, meaning the member remained in a learner state for an additional 30 seconds (https://github.com/clusterlabs/resource-agents/blob/677e3add17957a59b3b96137ff3d39ca7b99b280/heartbeat/podman-etcd#L1065-L1079).

You should update the code to react to the rc if we return an error here.

While we need reconcile_member_state to check the return code, so it can skip attribute cleanup if there's a failure, manage_peer_membership (which is above in the call stack) should ignore it, because failing to promote a member shouldn't stop the agent.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha - this distinguishes between cases where we need to clean up the attribute and not. I wonder if that was the case for my debug-start (since it would run multiple times in a row). If it had already forced a new cluster, the learner attribute may have been stale from the previous run.

fi
ocf_log info "successfully promoted member '$learner_member_id_hex'"
return $OCF_SUCCESS
Expand Down Expand Up @@ -1063,19 +1063,19 @@ reconcile_member_state()
fi

if [ -n "$learner_member_id" ]; then
promote_learner_member "$learner_member_id"
return $?
fi

if [ -z "$learner_member_id" ]; then
if ! clear_standalone_node; then
ocf_log error "could not clear standalone_node attribute, error code: $?"
return $OCF_ERR_GENERIC
fi
if ! attribute_learner_node clear; then
ocf_log error "could not clear learner_node attribute, error code: $?"
if ! promote_learner_member "$learner_member_id"; then
return $OCF_ERR_GENERIC
fi
# promotion succeded: continue to clear standalone_node and learner_node
fi

if ! clear_standalone_node; then
ocf_log error "could not clear standalone_node attribute, error code: $?"
return $OCF_ERR_GENERIC
fi
if ! attribute_learner_node clear; then
ocf_log error "could not clear learner_node attribute, error code: $?"
return $OCF_ERR_GENERIC
fi

return $OCF_SUCCESS
Expand Down Expand Up @@ -1258,6 +1258,7 @@ manage_peer_membership()
set_standalone_node
else
ocf_log debug "$name is in the members list by IP: $ip"
# Errors from reconcile_member_state are logged internally. Ignoring them here prevents stopping a healthy voter agent; critical local failures are caught by detect_cluster_leadership_loss.
reconcile_member_state "$member_list_json"
fi
done
Expand Down Expand Up @@ -1369,7 +1370,7 @@ container_health_check()
# Could not execute monitor check command and state file exists - the container failed, check recovery status in this lifecycle
local time_since_heartbeat
time_since_heartbeat=$(get_time_since_last_heartbeat)
ocf_log err "Container ${CONTAINER} failed (last healthy: ${time_since_heartbeat}s ago)"
ocf_log err "Container ${CONTAINER} failed (last healthy: ${time_since_heartbeat}s ago, error code: $rc)"

# Check if peer has set force_new_cluster for recovery
local fnc_holders
Expand Down Expand Up @@ -1796,6 +1797,9 @@ podman_start()
fi
;;
0)
# No active resources: clear any stale learner_node attribute from previous failed session
ocf_log debug "clearing stale learner_node attribute (safe when active_resources_count=0)"
attribute_learner_node clear
# count how many agents are starting now
local start_resources_count
start_resources_count=$(echo "$OCF_RESKEY_CRM_meta_notify_start_resource" | wc -w)
Expand Down Expand Up @@ -2033,6 +2037,7 @@ podman_stop()
ocf_log err "could not delete container health check state file"
fi

attribute_learner_node clear
attribute_node_revision update
attribute_node_cluster_id update

Expand Down