-
Notifications
You must be signed in to change notification settings - Fork 604
OCPEDGE-2213: podman-etcd: fix to prevent learner from starting before cluster is ready #2098
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -880,7 +880,7 @@ add_member_as_learner() | |
| local endpoint_url=$(ip_url $(attribute_node_ip get)) | ||
| local peer_url=$(ip_url $member_ip) | ||
|
|
||
| ocf_log info "add $member_name ($member_ip, $endpoint_url) to the member list as learner" | ||
| ocf_log info "add $member_name ($member_ip) to the member list as learner" | ||
| out=$(podman exec "${CONTAINER}" etcdctl --endpoints="$endpoint_url:2379" member add "$member_name" --peer-urls="$peer_url:2380" --learner) | ||
| rc=$? | ||
| if [ $rc -ne 0 ]; then | ||
|
|
@@ -1032,7 +1032,7 @@ promote_learner_member() | |
| if ! ocf_run podman exec "${CONTAINER}" etcdctl member promote "$learner_member_id_hex" 2>&1; then | ||
| # promotion is expected to fail if the peer is not yet up-to-date | ||
| ocf_log info "could not promote member $learner_member_id_hex, error code: $?" | ||
| return $OCF_SUCCESS | ||
| return $OCF_ERR_GENERIC | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why do we treat this as an error now? I know we need to retry this later, but as the comment says - it is OK for this to fail if we just not ready yet.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You should update the code to react to the rc if we return an error here.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I made this change to clean up The problem I wanted to address was that the attributes were not cleaned up immediately, but in the next monitor loop, meaning the member remained in a learner state for an additional 30 seconds (https://github.com/clusterlabs/resource-agents/blob/677e3add17957a59b3b96137ff3d39ca7b99b280/heartbeat/podman-etcd#L1065-L1079).
While we need There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Gotcha - this distinguishes between cases where we need to clean up the attribute and not. I wonder if that was the case for my debug-start (since it would run multiple times in a row). If it had already forced a new cluster, the learner attribute may have been stale from the previous run. |
||
| fi | ||
| ocf_log info "successfully promoted member '$learner_member_id_hex'" | ||
| return $OCF_SUCCESS | ||
|
|
@@ -1063,19 +1063,19 @@ reconcile_member_state() | |
| fi | ||
|
|
||
| if [ -n "$learner_member_id" ]; then | ||
| promote_learner_member "$learner_member_id" | ||
| return $? | ||
| fi | ||
|
|
||
| if [ -z "$learner_member_id" ]; then | ||
| if ! clear_standalone_node; then | ||
| ocf_log error "could not clear standalone_node attribute, error code: $?" | ||
| return $OCF_ERR_GENERIC | ||
| fi | ||
| if ! attribute_learner_node clear; then | ||
| ocf_log error "could not clear learner_node attribute, error code: $?" | ||
| if ! promote_learner_member "$learner_member_id"; then | ||
| return $OCF_ERR_GENERIC | ||
| fi | ||
| # promotion succeded: continue to clear standalone_node and learner_node | ||
| fi | ||
|
|
||
| if ! clear_standalone_node; then | ||
| ocf_log error "could not clear standalone_node attribute, error code: $?" | ||
| return $OCF_ERR_GENERIC | ||
| fi | ||
| if ! attribute_learner_node clear; then | ||
| ocf_log error "could not clear learner_node attribute, error code: $?" | ||
| return $OCF_ERR_GENERIC | ||
| fi | ||
|
|
||
| return $OCF_SUCCESS | ||
|
|
@@ -1258,6 +1258,7 @@ manage_peer_membership() | |
| set_standalone_node | ||
| else | ||
| ocf_log debug "$name is in the members list by IP: $ip" | ||
| # Errors from reconcile_member_state are logged internally. Ignoring them here prevents stopping a healthy voter agent; critical local failures are caught by detect_cluster_leadership_loss. | ||
| reconcile_member_state "$member_list_json" | ||
| fi | ||
| done | ||
|
|
@@ -1369,7 +1370,7 @@ container_health_check() | |
| # Could not execute monitor check command and state file exists - the container failed, check recovery status in this lifecycle | ||
| local time_since_heartbeat | ||
| time_since_heartbeat=$(get_time_since_last_heartbeat) | ||
| ocf_log err "Container ${CONTAINER} failed (last healthy: ${time_since_heartbeat}s ago)" | ||
| ocf_log err "Container ${CONTAINER} failed (last healthy: ${time_since_heartbeat}s ago, error code: $rc)" | ||
|
|
||
| # Check if peer has set force_new_cluster for recovery | ||
| local fnc_holders | ||
|
|
@@ -1796,6 +1797,9 @@ podman_start() | |
| fi | ||
| ;; | ||
| 0) | ||
| # No active resources: clear any stale learner_node attribute from previous failed session | ||
| ocf_log debug "clearing stale learner_node attribute (safe when active_resources_count=0)" | ||
| attribute_learner_node clear | ||
| # count how many agents are starting now | ||
| local start_resources_count | ||
| start_resources_count=$(echo "$OCF_RESKEY_CRM_meta_notify_start_resource" | wc -w) | ||
|
|
@@ -2033,6 +2037,7 @@ podman_stop() | |
| ocf_log err "could not delete container health check state file" | ||
| fi | ||
|
|
||
| attribute_learner_node clear | ||
| attribute_node_revision update | ||
| attribute_node_cluster_id update | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be checking if it's already a learner before we we add it as one? I see this error sometimes with the debug-start command, where one of the nodes is a learner, and thus add_member_as_learner fails when it is added again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I shouldn't even try to add a new member if it finds it in the member list
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It calls
add_member_as_learnerin two casespodman_etcd, when it just forced a new cluster, hence there can't be any learnermonitor, when there is no peer in the member list.I never tried with
debug-startandforce-new-clusterThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue in
debug-startwould be the first case, which means it just forced a new cluster. Could be a race condition, since I only see it in CI.