Ensure that state snapshot works for new shards #11600

Longarithm · 2024-06-17T20:51:51Z

I suspect we want to support state snapshots for shards which were freshly created. However, in this line

nearcore/core/store/src/trie/state_snapshot.rs

Line 100 in 61c67c6

if let Some(chunk) = block.chunks().get(shard_uid.shard_id as usize) {

we assume that shard uids correspond to the previous epoch only. We may want to support new shard layout as well.

Solution to #11583. The current logic to update flat storage for shard doesn't work for memtrie loading in some rare case. If shard doesn't contain active validators and didn't get any tx and receipts, the block for flat storage head will get GC-d, and attempt to read state root to assert with it will naturally panic. This happens because we have non-strict mode, which itself is used to make `StateSnapshot` work. But essentially it is enough to **not** move flat storage head past `epoch_last_block.chunks(shard_id).prev_block_hash()`. The flat storage state **past** this block exactly corresponds to the state we are syncing, see also #11600. So, this is exactly the new flat head candidate we compute and pass to `update_flat_head`. Passing tests show that non-strict mode is not needed. After that, we will have GC problem if there are no chunks for shard or no finality in the stored epochs, which is the assumption we make during development anyway. Nayduck will be at https://nayduck.nearone.org/#/run/149 ## Practical example One edge case when state snapshot will still work is when client just processed **second** block in an epoch. Then last final block will be not earlier last block in prev epoch; then new flat head will be not earlier than prev_block_hash for last chunk for our shard in it. Then state snapshot still works. For old implementation, this was guaranteed because while we pass last final block, we made _two steps back by non-empty state transitions_. First jump guarantees to skip last block **because it may contains validator updates**, the second jump guarantees to skip last chunk. So guarantees are the same. ## test_load_memtrie_after_empty_chunks * Add GCActor to the TestLoop. It clears blocks in background and doesn't need external control. * Ensure that shard 0 doesn't have validators and empty chunks for a long time. * Unload memtrie for shard 0 and load it back. I checked that in non-strict mode, as before the fix, it panics. * Additionally, check that if 2 chunks in the end of epoch are always missing, and we always move flat head to the final known block, then snapshotting always fails - so accounting for the latest chunk is actually needed!

Solution to #11583. The current logic to update flat storage for shard doesn't work for memtrie loading in some rare case. If shard doesn't contain active validators and didn't get any tx and receipts, the block for flat storage head will get GC-d, and attempt to read state root to assert with it will naturally panic. This happens because we have non-strict mode, which itself is used to make `StateSnapshot` work. But essentially it is enough to **not** move flat storage head past `epoch_last_block.chunks(shard_id).prev_block_hash()`. The flat storage state **past** this block exactly corresponds to the state we are syncing, see also #11600. So, this is exactly the new flat head candidate we compute and pass to `update_flat_head`. Passing tests show that non-strict mode is not needed. After that, we will have GC problem if there are no chunks for shard or no finality in the stored epochs, which is the assumption we make during development anyway. Nayduck will be at https://nayduck.nearone.org/#/run/149 One edge case when state snapshot will still work is when client just processed **second** block in an epoch. Then last final block will be not earlier last block in prev epoch; then new flat head will be not earlier than prev_block_hash for last chunk for our shard in it. Then state snapshot still works. For old implementation, this was guaranteed because while we pass last final block, we made _two steps back by non-empty state transitions_. First jump guarantees to skip last block **because it may contains validator updates**, the second jump guarantees to skip last chunk. So guarantees are the same. * Add GCActor to the TestLoop. It clears blocks in background and doesn't need external control. * Ensure that shard 0 doesn't have validators and empty chunks for a long time. * Unload memtrie for shard 0 and load it back. I checked that in non-strict mode, as before the fix, it panics. * Additionally, check that if 2 chunks in the end of epoch are always missing, and we always move flat head to the final known block, then snapshotting always fails - so accounting for the latest chunk is actually needed!

Longarithm added A-storage Area: storage and databases A-resharding Area: State resharding labels Jun 17, 2024

Longarithm mentioned this issue Jun 17, 2024

feat: strict flat storage update to fix GC issue #11599

Merged

github-actions bot mentioned this issue Jul 1, 2024

Monthly issue metrics report #11690

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure that state snapshot works for new shards #11600

Ensure that state snapshot works for new shards #11600

Longarithm commented Jun 17, 2024

Ensure that state snapshot works for new shards #11600

Ensure that state snapshot works for new shards #11600

Comments

Longarithm commented Jun 17, 2024