Skip to content

fix(failover): prevent double failover in case of lost connectivity#5788

Merged
fcanovai merged 2 commits intocloudnative-pg:mainfrom
leonardoce:api-split
Oct 15, 2024
Merged

fix(failover): prevent double failover in case of lost connectivity#5788
fcanovai merged 2 commits intocloudnative-pg:mainfrom
leonardoce:api-split

Conversation

@leonardoce
Copy link
Contributor

@leonardoce leonardoce commented Oct 10, 2024

This patch ensures the operator does not trigger two failovers when a primary Pod loses connectivity and fails to recognize its role change from primary to replica.

Previously, the first failover occurred when the operator detected that the primary Pod was no longer ready or present. A second failover could be triggered if the old primary Pod recovered before the Kubelet timeout, with the operator potentially promoting it to primary again based on the Pod list.

With this patch, the operator will wait for the recovered Pod to acknowledge its new role before taking further action, preventing unnecessary failovers.

Closes: #2513

Release notes

Prevent double failover in case of lost connectivity

@leonardoce leonardoce requested a review from a team as a code owner October 10, 2024 15:32
@cnpg-bot cnpg-bot added backport-requested ◀️ This pull request should be backported to all supported releases release-1.22 release-1.23 release-1.24 labels Oct 10, 2024
@github-actions
Copy link
Contributor

❗ By default, the pull request is configured to backport to all release branches.

  • To stop backporting this pr, remove the label: backport-requested ◀️ or add the label 'do not backport'
  • To stop backporting this pr to a certain release branch, remove the specific branch label: release-x.y

@leonardoce leonardoce force-pushed the api-split branch 3 times, most recently from 4f0e04f to cc666ff Compare October 12, 2024 11:48
@gbartolini gbartolini changed the title fix(failover): avoid failing over multiple times with unstable connectivity fix(failover): prevent double failover in case of lost connectivity Oct 14, 2024
@leonardoce
Copy link
Contributor Author

leonardoce commented Oct 14, 2024

leonardoce and others added 2 commits October 15, 2024 09:17
…tivity

This patch prevents the operator from failing over two times when a Pod
loses connectivity and doesn't notice the change of its current role
from primary to replica.

The first failover would happen when the operator notices the primary
Pod is not ready/present anymore. The second failover will happen if
the old primary comes back to life before the Kubelet timeout, with
the operator potentially failing over to the first one of the Pod list.

When this happens, we will wait for the Pod to understand its current
role.

Closes: cloudnative-pg#2513

Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
Signed-off-by: Gabriele Bartolini <gabriele.bartolini@enterprisedb.com>
@fcanovai
Copy link
Contributor

/ok-to-merge

@cnpg-bot cnpg-bot added the ok to merge 👌 This PR can be merged label Oct 15, 2024
@fcanovai fcanovai merged commit 3618164 into cloudnative-pg:main Oct 15, 2024
cnpg-bot pushed a commit that referenced this pull request Oct 15, 2024
…5788)

This patch ensures the operator does not trigger two failovers when a
primary Pod loses connectivity and fails to recognize its role change
from primary to replica.

Previously, the first failover occurred when the operator detected that
the primary Pod was no longer ready or present. A second failover could
be triggered if the old primary Pod recovered before the Kubelet
timeout, with the operator potentially promoting it to primary again
based on the Pod list.

With this patch, the operator will wait for the recovered Pod to
acknowledge its new role before taking further action, preventing
unnecessary failovers.

Closes: #2513

---------

Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
Signed-off-by: Gabriele Bartolini <gabriele.bartolini@enterprisedb.com>
Co-authored-by: Gabriele Bartolini <gabriele.bartolini@enterprisedb.com>
(cherry picked from commit 3618164)
cnpg-bot pushed a commit that referenced this pull request Oct 15, 2024
…5788)

This patch ensures the operator does not trigger two failovers when a
primary Pod loses connectivity and fails to recognize its role change
from primary to replica.

Previously, the first failover occurred when the operator detected that
the primary Pod was no longer ready or present. A second failover could
be triggered if the old primary Pod recovered before the Kubelet
timeout, with the operator potentially promoting it to primary again
based on the Pod list.

With this patch, the operator will wait for the recovered Pod to
acknowledge its new role before taking further action, preventing
unnecessary failovers.

Closes: #2513

---------

Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
Signed-off-by: Gabriele Bartolini <gabriele.bartolini@enterprisedb.com>
Co-authored-by: Gabriele Bartolini <gabriele.bartolini@enterprisedb.com>
(cherry picked from commit 3618164)
cnpg-bot pushed a commit that referenced this pull request Oct 15, 2024
…5788)

This patch ensures the operator does not trigger two failovers when a
primary Pod loses connectivity and fails to recognize its role change
from primary to replica.

Previously, the first failover occurred when the operator detected that
the primary Pod was no longer ready or present. A second failover could
be triggered if the old primary Pod recovered before the Kubelet
timeout, with the operator potentially promoting it to primary again
based on the Pod list.

With this patch, the operator will wait for the recovered Pod to
acknowledge its new role before taking further action, preventing
unnecessary failovers.

Closes: #2513

---------

Signed-off-by: Leonardo Cecchi <leonardo.cecchi@enterprisedb.com>
Signed-off-by: Gabriele Bartolini <gabriele.bartolini@enterprisedb.com>
Co-authored-by: Gabriele Bartolini <gabriele.bartolini@enterprisedb.com>
(cherry picked from commit 3618164)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-requested ◀️ This pull request should be backported to all supported releases ok to merge 👌 This PR can be merged release-1.22 release-1.23 release-1.24

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Outdated primary becomes the cluster primary

4 participants