Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop writing v2store and validate confChange against v3store instead of v2store #17019

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

ahrtr
Copy link
Member

@ahrtr ahrtr commented Nov 25, 2023

Part of #12913

  • We already generate v2 snapshot using v3store. It makes no sense to write any data into v2store, because it isn't used at all.
  • We should validate conf change against v3store, which is the source of truth in 3.6.

@ahrtr ahrtr marked this pull request as draft November 25, 2023 15:27
@ahrtr ahrtr force-pushed the remove_v2store_write_20231125 branch 2 times, most recently from 568a2c7 to 4f1f945 Compare November 25, 2023 16:04
@ahrtr
Copy link
Member Author

ahrtr commented Nov 25, 2023

#16084 is superseded by this PR.

@ahrtr ahrtr force-pushed the remove_v2store_write_20231125 branch from 4f1f945 to 7fba6fb Compare November 25, 2023 16:11
@ahrtr ahrtr marked this pull request as ready for review November 25, 2023 16:18
@ahrtr ahrtr marked this pull request as draft November 25, 2023 16:45
// TODO: this must be switched to backend as well.
membersMap, removedMap := membersFromStore(c.lg, c.v2store)
func (c *RaftCluster) ValidateConfigurationChange(cc raftpb.ConfChange, shouldApplyV3 ShouldApplyV3) error {
// It makes no sense to validate a confChange if we don't apply it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not true, we are still applying confChange as long as we bootstrap from snapshot and not consistent index.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, this is a longstanding design problem to me.

Current status summary

  • in older etcd versions ( < 3.4?) we load all data from v2store, and replay WAL entries on top of the snapshot. All work well.
  • Starting from 3.5, we load normal data from v3store and membership data still from v2store, and replay WAL entries on top of v2 snapshot. It works well for conf change. A little ugly for normal key/value data, but it still works.
  • Starting from 3.6 (main branch for now), we load all data from v3store on startup, but we still replay WAL entries on top of v2 snapshot.
    • For normal key/value data, etcd skips already applied entries; it works although ugly.
    • For conf change, the principle of "skiping already applied entries" doesn't work anymore.
      • For etcd, it already loaded membership data from v3store, so etcd can just skip already applied entries.
      • But raft doesn't know the latest membership data, we still depend on the raft loop to re-apply the conf change (with the old entry index < consistent-index), so as to notify raft all the membership data. So etcdd has to call raft's ApplyConfChange even etcd just skips it

Solutions

Short-term solutions

  1. Validate the conf change against v2store if it is already in v3store (shouldApplyV3 == false)
  2. apply conf change (read from v3store) to raft directly when restarting etcd members to bypass the raft loop? (need investigation)

Long-term solutions

  1. Bootstrap etcd from consistent_index.
    • It works for key/value data.
    • For conf change, it still needs to apply the conf change to raft directly when etcd restarting to bypass the raft loop.
  2. Generate v3 snapshot periodically (similar to what etcd did on v2 snapshot), and bootstrap etcd from v3 snapshot, and then replay WAL entries on top of it.
  • need to evaluate the performance of generate v3snapshot on huge db file (e.g. 8GB)
  • The disk space usage will be multiple times bigger, depending on how many snapshot files to keep.
  • Then why do we need the latest bbolt db file in such case?
  1. Generate v3 snapshot periodically (similar to what etcd did on v2 snapshot), but bootstrap etcd from consistent_index. The v3 snapshot files are just for corruption recovery, especially in single-node edge case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validate the conf change against v2store if it is already in v3store (shouldApplyV3 == false)

I know, #17017

Copy link
Member

@serathius serathius Nov 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apply conf change (read from v3store) to raft directly when restarting etcd members to bypass the raft loop? (need investigation)

I have draft implementation for that. #17022

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to evaluate the performance of generate v3snapshot on huge db file (e.g. 8GB)

v3 db snapshot is not a thing. That's an incompatible concept. Recommend reading the raft paper on in-memory snapshotable storage (etcd v2) vs persistent one (etcd v3).

@serathius
Copy link
Member

We should validate conf change against v3store, which is the source of truth in 3.6.

v3store will be a source of truth for membership, but it still is not for confState. confState is still loaded from snapshot and needs to be revalidate and reapplied using v2store. This validation cannot be skipped as it would lead to inconsistency with etcd v3.5. We need to migrate raft bootstrap to start from confstate in v3 store before we stop validating entries.

Let me add a test for this code too.

@serathius
Copy link
Member

Done, PTAL #17021

We already generate v2 snapshot using v3store to be compatible with
3.5. It makes no sense to write any data into v2store, which isn't
used at all.

Signed-off-by: Benjamin Wang <benjamin.ahrtr@gmail.com>
Signed-off-by: Benjamin Wang <benjamin.ahrtr@gmail.com>
Signed-off-by: Benjamin Wang <benjamin.ahrtr@gmail.com>
Signed-off-by: Benjamin Wang <benjamin.ahrtr@gmail.com>
@ahrtr ahrtr force-pushed the remove_v2store_write_20231125 branch 3 times, most recently from d3b20d2 to 9881db0 Compare November 26, 2023 16:08
@@ -517,52 +517,6 @@ func TestNodeToMemberBad(t *testing.T) {
}
}

func TestClusterAddMember(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please migrate such tests v3 instead of just removing them

@k8s-ci-robot
Copy link

@ahrtr: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-etcd-verify 9881db0 link true /test pull-etcd-verify
pull-etcd-unit-test-amd64 9881db0 link true /test pull-etcd-unit-test-amd64
pull-etcd-unit-test-arm64 9881db0 link true /test pull-etcd-unit-test-arm64

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants