Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: reading last snapshot in etcdutl migrate command #18010

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

fykaa
Copy link
Contributor

@fykaa fykaa commented May 15, 2024

Fixes a part of #17227

In the etcdutl/migrate_command.go file, I've rectified a minor oversight in the logic for reading the last snapshot during the WAL migration process. Previously, an empty walpb.Snapshot{} was passed as a reference, that would result in reading of the WAL from the beginning.

I will raise a follow-up PR that should also cover the addition of test cases, completing the solution for this issue

@k8s-ci-robot
Copy link

Hi @fykaa. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

To improve correctness of etcdutl migrate command

Signed-off-by: fykaa <faeka6@gmail.com>
@jmhbnz jmhbnz removed github_actions Pull requests that update GitHub Actions code area/robustness-testing area/observability labels May 19, 2024
@fykaa fykaa marked this pull request as ready for review May 20, 2024 12:43
@siyuanfoundation
Copy link
Contributor

Thanks @fykaa for the pr. Can you fix the tests?

@fykaa
Copy link
Contributor Author

fykaa commented May 24, 2024

Hi @siyuanfoundation, yes, I am on it. Fixing some test bugs that occured for me.

@karuppiah7890
Copy link
Contributor

/retest

@k8s-ci-robot
Copy link

@karuppiah7890: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@MadhavJivrajani
Copy link
Contributor

/ok-to-test

etcdutl/etcdutl/migrate_command.go Outdated Show resolved Hide resolved
etcdutl/etcdutl/migrate_command.go Outdated Show resolved Hide resolved
if err != nil {
return nil, fmt.Errorf(`failed to find last snapshot: %v`, err)
}
w, err := wal.OpenForRead(c.lg, walPath, walpb.Snapshot{Index: lastSnapshot})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e2e failure logs: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/etcd-io_etcd/18010/pull-etcd-e2e-amd64/1794986793938980864

The error seems to be coming from:

return nil, fmt.Errorf(`failed to read wal: %v`, err)

Error excerpt:

        	Error Trace:	/home/prow/go/src/github.com/etcd-io/etcd/tests/e2e/utl_migrate_test.go:162
        	Error:      	Error "[/home/prow/go/src/github.com/etcd-io/etcd/bin/etcdutl migrate --data-dir /tmp/TestEtctlutlMigrateMigrate_v3.5_to_v3.5_is_no-op2858091090/002/member-0 --target-version 3.5] match not found.  Set EXPECT_DEBUG for more info Errs: [unexpected exit code [128] after running [/home/prow/go/src/github.com/etcd-io/etcd/bin/etcdutl migrate --data-dir /tmp/TestEtctlutlMigrateMigrate_v3.5_to_v3.5_is_no-op2858091090/002/member-0 --target-version 3.5]], last lines:\n2024-05-27T07:18:46Z\tinfo\tbbolt\tbackend/backend.go:197\tOpening db file (/tmp/TestEtctlutlMigrateMigrate_v3.5_to_v3.5_is_no-op2858091090/002/member-0/member/snap/db) with mode 2d72772d2d2d2d2d2d2d and with options: {Timeout: 0s, NoGrowSync: false, NoFreelistSync: true, PreLoadFreelist: false, FreelistType: , ReadOnly: false, MmapFlags: 8000, InitialMmapSize: 10737418240, PageSize: 0, NoSync: false, OpenFile: 0x0, Mlock: false, Logger: 0xc000130238}\r\n2024-05-27T07:18:46Z\tinfo\tbbolt\tbbolt@v1.4.0-alpha.1/db.go:321\tOpening bbolt db (/tmp/TestEtctlutlMigrateMigrate_v3.5_to_v3.5_is_no-op2858091090/002/member-0/member/snap/db) successfully\r\nError: failed to read wal: wal: snapshot mismatch\r\n (expected \"storage version up-to-date\\t{\\\"storage-version\\\": \\\"3.5\\\"}\", got []). Try EXPECT_DEBUG=TRUE" does not contain "storage version up-to-date\t{\"storage-version\": \"3.5\"}"

The error shown:

wal: snapshot mismatch

is returned only when indices match but Terms don't afaict, this is the error type:

ErrSnapshotMismatch = errors.New("wal: snapshot mismatch")

and that comes from here because we end up calling ReadAll as part of ReadWALVersion:

if snap.Term != w.start.Term {
state.Reset()
return nil, state, nil, ErrSnapshotMismatch
}

w.start here is ultimately the snapshot object that we pass into OpenForRead:

start: snap,

I think we should be passing in the Term here as well as part of walpb.Snapshot

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing out the issue with the snapshot mismatch in the wal package.

I've made some adjustments to implement your suggestion. The getLastSnapshot function now returns the walpb.Snapshot object, which includes full snapshot information (the Index and Term). This snapshot object is then passed directly to wal.OpenForRead. I believe this should help mitigate the ErrSnapshotMismatch issue.

@siyuanfoundation
Copy link
Contributor

/retest

@k8s-ci-robot
Copy link

@fykaa: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-etcd-e2e-amd64 3815ab0 link false /test pull-etcd-e2e-amd64

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@siyuanfoundation
Copy link
Contributor

siyuanfoundation commented Jun 14, 2024

if you print out the WAL version after

c.walVersion, err = wal.ReadWALVersion(w)
if err != nil {
  return nil, fmt.Errorf(`failed to read wal: %v`, err)
}
fmt.Printf("got walVersion = %v\n", c.walVersion.MinimalEtcdVersion().String())

for the test Downgrade v3.6 to v3.5 with force should work, the main branch walVersion=3.6.0, with this change walVersion=3.0.0. So it is not reading the correct snapshot. The test got stuck in e2e.SpawnWithExpect forever because it could not find the expected log message.

@serathius
Copy link
Member

ping any updates?

@k8s-ci-robot
Copy link

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

7 participants