-
Notifications
You must be signed in to change notification settings - Fork 685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(state-snapshots): fix skipped state snapshot after header sync #12920
fix(state-snapshots): fix skipped state snapshot after header sync #12920
Conversation
We decide whether to take a snapshot based on the return value of is_sync_prev_hash(), which looks at the StateSyncNewChunks column to tell whether the given hash will be the prev_hash of the sync_hash. But the current logic assumes that this will be called only before the sync hash is actually found, because we clear that column after it is found since the contents are not needed anymore. This means that if we processed the sync header during header sync instead of during normal block processing, then is_sync_prev_hash() will not return true when it should. We can fix it by just checking if the sync hash is already set before looking at the values in StateSyncNewChunks.
this gets triggered when the node falls behind before it needs to make a snapshot, and then header syncs before actually applying the blocks. It can be reproduced w this diff: diff --git a/chain/chain/src/chain.rs b/chain/chain/src/chain.rs
index 03d846747..b70d3a554 100644
--- a/chain/chain/src/chain.rs
+++ b/chain/chain/src/chain.rs
@@ -2562,6 +2562,10 @@ impl Chain {
if !self.epoch_manager.is_next_block_epoch_start(prev_hash)? {
return Ok((self.prev_block_is_caught_up(prev_prev_hash, prev_hash)?, None));
}
+ if me.as_ref().map(|a| a.as_str()) == Some("test5") {
+ eprintln!("xxxxxxxx sleep {}", block_hash);
+ std::thread::sleep(std::time::Duration::from_secs(20));
+ }
if !self.prev_block_is_caught_up(prev_prev_hash, prev_hash)? {
// The previous block is not caught up for the next epoch relative to the previous
// block, which is the current epoch for this block, so this block cannot be applied
diff --git a/pytest/tests/sanity/transactions.py b/pytest/tests/sanity/transactions.py
index bdc72f68f..74bae75d6 100755
--- a/pytest/tests/sanity/transactions.py
+++ b/pytest/tests/sanity/transactions.py
@@ -23,9 +23,9 @@ nodes = start_cluster(
num_shards=4,
config=None,
extra_state_dumper=True,
- genesis_config_changes=[["min_gas_price",
+ genesis_config_changes=[['protocol_version', 73], ["min_gas_price",
0], ["max_inflation_rate", [0, 1]],
- ["epoch_length", 10],
+ ["epoch_length", 100],
["block_producer_kickout_threshold", 70]],
client_config_changes={
0: {
@@ -111,6 +111,9 @@ nodes = start_cluster(
}
})
+for height, hash in utils.poll_blocks(nodes[5], timeout=99999999):
+ if height > 280:
+ exit()
started = time.time()
act_to_val = [4, 4, 4, 4, 4, 4]
Before this change you will see: $ python3 tests/sanity/transactions.py
...
$ grep -E "(make_snapshot_callback sends)|(Wrong snapshot hash)" ~/.near/test5_finished/stderr | head -n 3
2025-02-12T21:58:24.014299Z INFO handle{handler="ShardsManagerResponse" actor="ClientActorInner>"}:NetworkClientMessage:process_blocks_with_missing_chunks:start_process_block_async{provenance=NONE height=102}:start_process_block_impl{block_height=102}: state_snapshot: make_snapshot_callback sends `DeleteAndMaybeCreateSnapshotRequest` to state_snapshot_addr prev_block_hash=coXGCuyPVpTRyuuB9QZBB91hSzwt1qZaoPMgGrsYUy2 shard_indexes_and_uids=[(0, s3.v0), (1, s1.v0), (2, s0.v0), (3, s2.v0)]
2025-02-12T21:59:19.063020Z WARN state_sync_dump: Failed to obtain state part. Retrying in 200 millis. shard_id=0 epoch_height=2 epoch_id=EpochId(3ZFnzcsMycVWo95x4uhpQ1kTthNBmTLnw6nvTrHtUJpG) part_id=PartId { idx: 0, total: 1 } error=Other("StorageInconsistentState(\"Wrong snapshot hash. Requested: 5YKYpFDMRvptreBnwasH4nFE7r3gqoS8J53QEpSS69an, Available: coXGCuyPVpTRyuuB9QZBB91hSzwt1qZaoPMgGrsYUy2\")")
2025-02-12T21:59:19.063124Z WARN state_sync_dump: Failed to obtain state part. Retrying in 200 millis. shard_id=3 epoch_height=2 epoch_id=EpochId(3ZFnzcsMycVWo95x4uhpQ1kTthNBmTLnw6nvTrHtUJpG) part_id=PartId { idx: 0, total: 1 } error=Other("StorageInconsistentState(\"Wrong snapshot hash. Requested: 5YKYpFDMRvptreBnwasH4nFE7r3gqoS8J53QEpSS69an, Available: coXGCuyPVpTRyuuB9QZBB91hSzwt1qZaoPMgGrsYUy2\")")
then with this PR: $ python3 tests/sanity/transactions.py
...
$ grep -E "(make_snapshot_callback sends)|(Wrong snapshot hash)" ~/.near/test5_finished/stderr | head -n 3
2025-02-12T22:08:03.216420Z INFO handle{handler="ShardsManagerResponse" actor="ClientActorInner>"}:NetworkClientMessage:process_blocks_with_missing_chunks:start_process_block_async{provenance=NONE height=102}:start_process_block_impl{block_height=102}: state_snapshot: make_snapshot_callback sends `DeleteAndMaybeCreateSnapshotRequest` to state_snapshot_addr prev_block_hash=6DNVkZdLdDVvL8bA5G6zhFa7TsLCRzaR7mvCxA7br7v shard_indexes_and_uids=[(0, s3.v0), (1, s1.v0), (2, s0.v0), (3, s2.v0)]
2025-02-12T22:08:54.300846Z INFO handle{handler="ShardsManagerResponse" actor="ClientActorInner>"}:NetworkClientMessage:process_blocks_with_missing_chunks:start_process_block_async{provenance=NONE height=205}:start_process_block_impl{block_height=205}: state_snapshot: make_snapshot_callback sends `DeleteAndMaybeCreateSnapshotRequest` to state_snapshot_addr prev_block_hash=7ju5AL2YzyRM2NLwva2JAX4FpEYXUBg1z7D62E9aihjX shard_indexes_and_uids=[(0, s3.v0), (1, s1.v0), (2, s0.v0), (3, s2.v0)] Note there is only one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice find!
) -> Result<bool, Error> { | ||
let Some(new_chunks) = get_state_sync_new_chunks(store, block_hash)? else { | ||
pub(crate) fn is_sync_prev_hash(chain_store: &ChainStoreAdapter, tip: &Tip) -> Result<bool, Error> { | ||
if let Some(sync_hash) = chain_store.get_current_epoch_sync_hash(&tip.epoch_id)? { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a short comment about why this is important? Otherwise it looks like just an optimization but it's actually about correctness.
…ear#12920) We decide whether to take a snapshot based on the return value of is_sync_prev_hash(), which looks at the StateSyncNewChunks column to tell whether the given hash will be the prev_hash of the sync_hash. But the current logic assumes that this will be called only before the sync hash is actually found, because we clear that column after it is found since the contents are not needed anymore. This means that if we processed the sync header during header sync instead of during normal block processing, then is_sync_prev_hash() will not return true when it should. We can fix it by just checking if the sync hash is already set before looking at the values in StateSyncNewChunks.
We decide whether to take a snapshot based on the return value of is_sync_prev_hash(), which looks at the StateSyncNewChunks column to tell whether the given hash will be the prev_hash of the sync_hash. But the current logic assumes that this will be called only before the sync hash is actually found, because we clear that column after it is found since the contents are not needed anymore.
This means that if we processed the sync header during header sync instead of during normal block processing, then is_sync_prev_hash() will not return true when it should. We can fix it by just checking if the sync hash is already set before looking at the values in StateSyncNewChunks.