Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lease Checkpoints fix #13508

Merged
merged 4 commits into from
Dec 2, 2021
Merged

Lease Checkpoints fix #13508

merged 4 commits into from
Dec 2, 2021

Conversation

serathius
Copy link
Member

This is continuation of work on #13491
Main difference is that this proposes to persist remainingTTL on checkpoint making checkpoints not depend on snapshot frequency.

I also check that renew logic should not be impacted by this change. Renew already includes logic that will reset remainingTTL if needed.

Current checkpointing mechanism is buggy. New checkpoints for any lease
are scheduled only until the first leader change. Added fix for that
and a test that will check it.
To extend lease checkpointing mechanism to cases when the whole etcd
cluster is restarted.
@serathius
Copy link
Member Author

cc @ptabor

@serathius
Copy link
Member Author

cc @jpbetz

@ptabor
Copy link
Contributor

ptabor commented Nov 29, 2021

Can this cause inconsistency of checksums of the DBs computed from diffrent instances,
i.e. checkpointing is enabled with 3.6 (or 3.5.3 backport) so version 3.5.2 would had different post-raft state in comparison to version that didn't got upgraded to 3.5.2...
We should have tests / know what are the implications of such 'mismatch',

@serathius serathius force-pushed the checkpoints-fix branch 2 times, most recently from b4cf39e to 164fa1a Compare December 1, 2021 15:10
@serathius
Copy link
Member Author

I have added a proper feature guarding to ensure that we avoid checksum inconsistency.
First we require cluster version to be at least v3.6, this way will handle upgrades as cluster version is bumped only after all members of cluster are upgraded. Second I added a flag to provide compatibility for v3.5 -> v3.6 upgrades. As we are planning to backport this fix to v3.5 this flag will be used in v3.5 to enable checkpoint persist. However for v3.6 this flag will be used to ensure consistency during upgrade as it will inform v3.6 members that cluster already has this feature enabled.

Copy link
Contributor

@ptabor ptabor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you.

server/etcdmain/config.go Outdated Show resolved Hide resolved
server/lease/lessor.go Outdated Show resolved Hide resolved
server/lease/lessor.go Outdated Show resolved Hide resolved
server/lease/lessor_test.go Outdated Show resolved Hide resolved
server/lease/lessor_test.go Outdated Show resolved Hide resolved
server/lease/lessor_test.go Outdated Show resolved Hide resolved
Copy link
Contributor

@jpbetz jpbetz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of persistTo and scheduling checkpoints on promotion look right to me. Thanks for fixing. One comment about flag docs.

server/embed/config.go Outdated Show resolved Hide resolved
…lease-checkpoint-persist to persist lease remainingTTL

To avoid inconsistant behavior during cluster upgrade we are feature
gating persistance behind cluster version. This should ensure that
all cluster members are upgraded to v3.6 before changing behavior.

To allow backporting this fix to v3.5 we are also introducing flag
--experimental-enable-lease-checkpoint-persist that will allow for
smooth upgrade in v3.5 clusters with this feature enabled.
@serathius
Copy link
Member Author

Addressed all the comments, @ptabor please take another look.

@ahrtr
Copy link
Member

ahrtr commented Jul 20, 2022

@serathius Do you have bandwidth to backport this PR to 3.4? :)

@serathius
Copy link
Member Author

Done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants