Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(state-snapshots): fix skipped state snapshot after header sync #12920

Merged
merged 2 commits into from
Feb 13, 2025

Conversation

marcelo-gonzalez
Copy link
Contributor

We decide whether to take a snapshot based on the return value of is_sync_prev_hash(), which looks at the StateSyncNewChunks column to tell whether the given hash will be the prev_hash of the sync_hash. But the current logic assumes that this will be called only before the sync hash is actually found, because we clear that column after it is found since the contents are not needed anymore.

This means that if we processed the sync header during header sync instead of during normal block processing, then is_sync_prev_hash() will not return true when it should. We can fix it by just checking if the sync hash is already set before looking at the values in StateSyncNewChunks.

We decide whether to take a snapshot based on the return value of
is_sync_prev_hash(), which looks at the StateSyncNewChunks column
to tell whether the given hash will be the prev_hash of the sync_hash.
But the current logic assumes that this will be called only before
the sync hash is actually found, because we clear that column after it
is found since the contents are not needed anymore.

This means that if we processed the sync header during header sync instead
of during normal block processing, then is_sync_prev_hash() will not
return true when it should. We can fix it by just checking if the sync
hash is already set before looking at the values in StateSyncNewChunks.
@marcelo-gonzalez marcelo-gonzalez requested a review from a team as a code owner February 12, 2025 21:52
@marcelo-gonzalez
Copy link
Contributor Author

this gets triggered when the node falls behind before it needs to make a snapshot, and then header syncs before actually applying the blocks. It can be reproduced w this diff:

diff --git a/chain/chain/src/chain.rs b/chain/chain/src/chain.rs
index 03d846747..b70d3a554 100644
--- a/chain/chain/src/chain.rs
+++ b/chain/chain/src/chain.rs
@@ -2562,6 +2562,10 @@ impl Chain {
         if !self.epoch_manager.is_next_block_epoch_start(prev_hash)? {
             return Ok((self.prev_block_is_caught_up(prev_prev_hash, prev_hash)?, None));
         }
+        if me.as_ref().map(|a| a.as_str()) == Some("test5") {
+            eprintln!("xxxxxxxx sleep {}", block_hash);
+            std::thread::sleep(std::time::Duration::from_secs(20));
+        }
         if !self.prev_block_is_caught_up(prev_prev_hash, prev_hash)? {
             // The previous block is not caught up for the next epoch relative to the previous
             // block, which is the current epoch for this block, so this block cannot be applied
diff --git a/pytest/tests/sanity/transactions.py b/pytest/tests/sanity/transactions.py
index bdc72f68f..74bae75d6 100755
--- a/pytest/tests/sanity/transactions.py
+++ b/pytest/tests/sanity/transactions.py
@@ -23,9 +23,9 @@ nodes = start_cluster(
     num_shards=4,
     config=None,
     extra_state_dumper=True,
-    genesis_config_changes=[["min_gas_price",
+    genesis_config_changes=[['protocol_version', 73], ["min_gas_price",
                              0], ["max_inflation_rate", [0, 1]],
-                            ["epoch_length", 10],
+                            ["epoch_length", 100],
                             ["block_producer_kickout_threshold", 70]],
     client_config_changes={
         0: {
@@ -111,6 +111,9 @@ nodes = start_cluster(
         }
     })
 
+for height, hash in utils.poll_blocks(nodes[5], timeout=99999999):
+    if height > 280:
+        exit()
 started = time.time()
 
 act_to_val = [4, 4, 4, 4, 4, 4]

Before this change you will see:

$ python3 tests/sanity/transactions.py
...
$ grep -E "(make_snapshot_callback sends)|(Wrong snapshot hash)" ~/.near/test5_finished/stderr | head -n 3
2025-02-12T21:58:24.014299Z  INFO handle{handler="ShardsManagerResponse" actor="ClientActorInner>"}:NetworkClientMessage:process_blocks_with_missing_chunks:start_process_block_async{provenance=NONE height=102}:start_process_block_impl{block_height=102}: state_snapshot: make_snapshot_callback sends `DeleteAndMaybeCreateSnapshotRequest` to state_snapshot_addr prev_block_hash=coXGCuyPVpTRyuuB9QZBB91hSzwt1qZaoPMgGrsYUy2 shard_indexes_and_uids=[(0, s3.v0), (1, s1.v0), (2, s0.v0), (3, s2.v0)]
2025-02-12T21:59:19.063020Z  WARN state_sync_dump: Failed to obtain state part. Retrying in 200 millis. shard_id=0 epoch_height=2 epoch_id=EpochId(3ZFnzcsMycVWo95x4uhpQ1kTthNBmTLnw6nvTrHtUJpG) part_id=PartId { idx: 0, total: 1 } error=Other("StorageInconsistentState(\"Wrong snapshot hash. Requested: 5YKYpFDMRvptreBnwasH4nFE7r3gqoS8J53QEpSS69an, Available: coXGCuyPVpTRyuuB9QZBB91hSzwt1qZaoPMgGrsYUy2\")")
2025-02-12T21:59:19.063124Z  WARN state_sync_dump: Failed to obtain state part. Retrying in 200 millis. shard_id=3 epoch_height=2 epoch_id=EpochId(3ZFnzcsMycVWo95x4uhpQ1kTthNBmTLnw6nvTrHtUJpG) part_id=PartId { idx: 0, total: 1 } error=Other("StorageInconsistentState(\"Wrong snapshot hash. Requested: 5YKYpFDMRvptreBnwasH4nFE7r3gqoS8J53QEpSS69an, Available: coXGCuyPVpTRyuuB9QZBB91hSzwt1qZaoPMgGrsYUy2\")")

then with this PR:

$ python3 tests/sanity/transactions.py
...
$ grep -E "(make_snapshot_callback sends)|(Wrong snapshot hash)" ~/.near/test5_finished/stderr | head -n 3
2025-02-12T22:08:03.216420Z  INFO handle{handler="ShardsManagerResponse" actor="ClientActorInner>"}:NetworkClientMessage:process_blocks_with_missing_chunks:start_process_block_async{provenance=NONE height=102}:start_process_block_impl{block_height=102}: state_snapshot: make_snapshot_callback sends `DeleteAndMaybeCreateSnapshotRequest` to state_snapshot_addr prev_block_hash=6DNVkZdLdDVvL8bA5G6zhFa7TsLCRzaR7mvCxA7br7v shard_indexes_and_uids=[(0, s3.v0), (1, s1.v0), (2, s0.v0), (3, s2.v0)]
2025-02-12T22:08:54.300846Z  INFO handle{handler="ShardsManagerResponse" actor="ClientActorInner>"}:NetworkClientMessage:process_blocks_with_missing_chunks:start_process_block_async{provenance=NONE height=205}:start_process_block_impl{block_height=205}: state_snapshot: make_snapshot_callback sends `DeleteAndMaybeCreateSnapshotRequest` to state_snapshot_addr prev_block_hash=7ju5AL2YzyRM2NLwva2JAX4FpEYXUBg1z7D62E9aihjX shard_indexes_and_uids=[(0, s3.v0), (1, s1.v0), (2, s0.v0), (3, s2.v0)]

Note there is only one make_snapshot_callback sends log line in the one run from master

Copy link
Contributor

@wacban wacban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice find!

) -> Result<bool, Error> {
let Some(new_chunks) = get_state_sync_new_chunks(store, block_hash)? else {
pub(crate) fn is_sync_prev_hash(chain_store: &ChainStoreAdapter, tip: &Tip) -> Result<bool, Error> {
if let Some(sync_hash) = chain_store.get_current_epoch_sync_hash(&tip.epoch_id)? {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a short comment about why this is important? Otherwise it looks like just an optimization but it's actually about correctness.

@marcelo-gonzalez marcelo-gonzalez added this pull request to the merge queue Feb 13, 2025
Merged via the queue into near:master with commit a425867 Feb 13, 2025
23 of 24 checks passed
@marcelo-gonzalez marcelo-gonzalez deleted the snapshot-header-sync branch February 13, 2025 15:06
marcelo-gonzalez added a commit to marcelo-gonzalez/nearcore that referenced this pull request Feb 13, 2025
…ear#12920)

We decide whether to take a snapshot based on the return value of
is_sync_prev_hash(), which looks at the StateSyncNewChunks column to
tell whether the given hash will be the prev_hash of the sync_hash. But
the current logic assumes that this will be called only before the sync
hash is actually found, because we clear that column after it is found
since the contents are not needed anymore.

This means that if we processed the sync header during header sync
instead of during normal block processing, then is_sync_prev_hash() will
not return true when it should. We can fix it by just checking if the
sync hash is already set before looking at the values in
StateSyncNewChunks.
marcelo-gonzalez added a commit that referenced this pull request Feb 13, 2025
These cherry picks fix some issues found in testing:

* #12907
* #12912
* #12920
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants