Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix a node can't accept more logs after receiving snapshot #3909

Merged
merged 4 commits into from
Mar 18, 2022

Conversation

critical27
Copy link
Contributor

@critical27 critical27 commented Feb 17, 2022

What type of PR is this?

  • bug
  • feature
  • enhancement

What problem(s) does this PR solve?

Issue(s) number:

Description:

See the logs below, logs unrelated has been ignored:

  1. A node has received a snapshot of committedLogId_ 14958, committedLogTerm_ 27,
  2. It append one more log successfully (see line 5). The lastLogId_ is 14959 now
  3. Leader send a appendLogRequest with lastLogIdSent = 14965 (line 7), we will return LOG_GAP, and ask leader send logs from committedLogId_ (14958)
  4. Leader do send logs from 14958, however, the weird things happened:
    Since out first log id of wal is 14959, we don't have the log of 14958, so the condition below always is true:
    wal_->getLogTerm(req.get_last_log_id_sent()) != req.get_last_log_term_sent(), so we ask leader send logs from committedLogId_(14958) again
  5. Repeat step 4 forever
1 RaftPart.cpp:1795] [Port: 9780, Space: 1, Part: 170] Begin to receive the snapshot
2 Part.cpp:464] [Port: 9780, Space: 1, Part: 170] Clean rocksdb part data
3 RaftPart.cpp:1823] [Port: 9780, Space: 1, Part: 170] Receive all snapshot, committedLogId_ 14958, committedLogTerm_ 27, lastLodId 14958, lastLogTermId 27
4 Part.cpp:206] [Port: 9780, Space: 1, Part: 170] Find the new leader "store1":9780
5 RaftPart.cpp:1471] [Port: 9780, Space: 1, Part: 170] Received logAppend: GraphSpaceId = 1, partition = 170, leaderIp = store1, leaderPort = 9780, current_term = 27, committedLogId = 14958, lastLogIdSent = 14958, lastLogTermSent = 27, num_logs = 1, local lastLogId = 14958, local lastLogTerm = 27, local committedLogId = 14958, local current term = 27, wal lastLogId = 0
6 RaftPart.cpp:1677] [Port: 9780, Space: 1, Part: 170] The current role is Follower. Will follow the new leader "store1":9780 on term 28
7 RaftPart.cpp:1471] [Port: 9780, Space: 1, Part: 170] Received logAppend: GraphSpaceId = 1, partition = 170, leaderIp = store1, leaderPort = 9780, current_term = 28, committedLogId = 14975, lastLogIdSent = 14965, lastLogTermSent = 27, num_logs = 11, local lastLogId = 14959, local lastLogTerm = 27, local committedLogId = 14958, local current term = 28, wal lastLogId = 14959
8 RaftPart.cpp:1471] [Port: 9780, Space: 1, Part: 170] Received logAppend: GraphSpaceId = 1, partition = 170, leaderIp = store1, leaderPort = 9780, current_term = 28, committedLogId = 14975, lastLogIdSent = 14958, lastLogTermSent = 27, num_logs = 18, local lastLogId = 14959, local lastLogTerm = 27, local committedLogId = 14958, local current term = 28, wal lastLogId = 14959
9 WalFileIterator.cpp:24] [Port: 9780, Space: 1, Part: 170] The given log id 14958 is out of the range, the wal firstLogId is 14959
10 RaftPart.cpp:1471] [Port: 9780, Space: 1, Part: 170] Received logAppend: GraphSpaceId = 1, partition = 170, leaderIp = store1, leaderPort = 9780, current_term = 28, committedLogId = 14975, lastLogIdSent = 14958, lastLogTermSent = 27, num_logs = 18, local lastLogId = 14959, local lastLogTerm = 27, local committedLogId = 14958, local current term = 28, wal lastLogId = 14959
11 WalFileIterator.cpp:24] [Port: 9780, Space: 1, Part: 170] The given log id 14958 is out of the range, the wal firstLogId is 14959
12 RaftPart.cpp:1471] [Port: 9780, Space: 1, Part: 170] Received logAppend: GraphSpaceId = 1, partition = 170, leaderIp = store1, leaderPort = 9780, current_term = 28, committedLogId = 14975, lastLogIdSent = 14958, lastLogTermSent = 27, num_logs = 18, local lastLogId = 14959, local lastLogTerm = 27, local committedLogId = 14958, local current term = 28, wal lastLogId = 14959
13 WalFileIterator.cpp:24] [Port: 9780, Space: 1, Part: 170] The given log id 14958 is out of the range, the wal firstLogId is 14959
...

How do you solve it?

When we failed to get a log from wal (for example, we pull snapshot), we could use committedLogId_ and committedLogTerm_ to do raft log matching.

Special notes for your reviewer, ex. impact of this fix, design document, etc:

Checklist:

Tests:

  • standalone test environment

Affects:

  • Documentation affected (Please add the label if documentation needs to be modified.)
  • Incompatibility (If it breaks the compatibility, please describe it and add the label.)
  • If it's needed to cherry-pick (If cherry-pick to some branches is required, please label the destination version(s).)
  • Performance impacted: Consumes more CPU/Memory

Release notes:

Please confirm whether to be reflected in release notes and how to describe:
Fix a node can't accept more logs after receiving snapshot

wal_->lastLogId() < req.get_last_log_id_sent() ||
wal_->getLogTerm(req.get_last_log_id_sent()) != req.get_last_log_term_sent()) {
wal_->lastLogId() < req.get_last_log_id_sent()) {
// case 1 and case 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so case 1 is last_log_id < committedLogId_ and case 2 is wal_->lastLogId() < get_last_log_id_sent ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, see the comment above in line 1539

@critical27
Copy link
Contributor Author

Don't merge plz, test is still on-going

@critical27 critical27 removed the ready-for-testing PR: ready for the CI test label Mar 3, 2022
@critical27
Copy link
Contributor Author

Ready to go

@Sophie-Xie Sophie-Xie added the ready-for-testing PR: ready for the CI test label Mar 17, 2022
@Shinji-IkariG Shinji-IkariG merged commit 8585de6 into vesoft-inc:master Mar 18, 2022
@critical27 critical27 deleted the snapshot branch March 18, 2022 07:54
@critical27 critical27 mentioned this pull request Jul 4, 2022
11 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-testing PR: ready for the CI test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants