Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep synchronizing slots when others are lagging on primary. #31

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

rdunklau
Copy link
Contributor

Instead of blocking indefinitely for a replication slot to be syncable, introduce a new GUC pg_failover_slots.sync_timeout after which we will move to the next one.

To avoid waiting from scratch, we create the replication as temporary ones instead of ephemeral ones, allowing them to keep their state between runs. When the slot is finally synced, we persist it to disk.

Since we do not block in waiting state anymore, we need to cleanup the inconsistent slots after promotion.

This solves the possible issue of having an inactive slot in the primary which prevents every other slots to be synced to the standby.

@kfcss
Copy link

kfcss commented Oct 5, 2023

We have noticed the same problems. This results in the secondary consuming more disk space than the primary.

Copy link
Collaborator

@PJMODOS PJMODOS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks interesting, I am starting to wonder whether we need some kind of status function that will report state of all the slots in some user consumable way.

@rdunklau
Copy link
Contributor Author

This looks interesting, I am starting to wonder whether we need some kind of status function that will report state of all the slots in some user consumable way.

For monitoring purposes that's a pretty good idea. As of now, one can rely on the persistence of the slot to infer it's status but it's not ideal.

@rdunklau
Copy link
Contributor Author

Do you have any opinion on that design ?

Instead of blocking indefinitely for a replication slot to be syncable,
introduce a new GUC pg_failover_slots.sync_timeout after which we will
move to the next one.

To avoid waiting from scratch, we create the replication as temporary
ones instead of ephemeral ones, allowing them to keep their state
between runs. When the slot is finally synced, we persist it to disk.

Since we do not block in waiting state anymore, we need to cleanup the
inconsistent slots after promotion.
@rdunklau
Copy link
Contributor Author

I just rebased it on the current master branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants