-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Join from an existing snapshot #1532
Conversation
…nto async_snapshot_generation
@@ -571,13 +588,14 @@ if(BUILD_TESTS) | |||
NAME reconfiguration_test | |||
PYTHON_SCRIPT ${CMAKE_SOURCE_DIR}/tests/reconfiguration.py | |||
CONSENSUS raft | |||
ADDITIONAL_ARGS --raft-election-timeout 4000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lowering the raft election timeout here (default is 100s) as the reconfiguration test now also waits for a new election to complete.
auto seqno = network.tables->current_version(); | ||
consensus->force_become_backup(seqno, sig->view); | ||
|
||
reset_data(config.joining.snapshot); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once the snapshot has been applied, we don't want to keep it around in enclave memory.
join_from_snapshot@11998 aka 20200827.8 vs master ewma over 50 builds from 11569 to 11985 |
|
||
reset_data(config.joining.snapshot); | ||
LOG_INFO_FMT( | ||
"Joiner successfully resumed from snapshot at seqno {} and " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds a slightly more final than it really is: we've started resuming from this snapshot, but until we see the evidence from the primary, we don't know if the snapshot is valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. That's the next step for this feature so I will address that in the next PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with some minor comments.
Co-authored-by: Amaury Chamayou <amaury@xargs.fr>
Co-authored-by: Amaury Chamayou <amaury@xargs.fr>
The PR that puts all the previous snapshot PRs together.
It is now possible for a node to join a network from an existing snapshot, making the catch-up phase much quicker (basically, only replaying the entries from the last snapshot to the last index).
To do so, the operator(s) of the CCF network are responsible for copying the snapshot file(s) produced by one node (under the
snapshots/
folder) to the directory of the new joiner. The new joiner will automatically try to join from the latest (based on thesnapshot.idx
file name) snapshot, deserialising the snapshot in its store when joining (*).Raft will then be forced to become follower at a specific commit index. The snapshotter and the ledger will also have to be initialised at that index so that the ledger and snapshots on the joiner node are the same than those on other nodes in the network.
I've also added a new
snapshot
test suite to test this more thoroughly + all the necessary changes to the Python infra to get the snapshots from one node copied over to the new node.(*) The snapshot can only be deserialised after the ledger secrets have been passed in by the network so that the snapshot can be decrypted.
Next: