Description
Goals
Background
The goal of the resharding project is to implement fast and robust resharding.
Why should NEAR One work on this
Currently, NEAR protocol has four shards. With more partners onboarding, we started seeing that some shards occasionally become over-crowded with respect to total state size and number of transactions. In addition, with state sync and stateless validation, validators will not need to track all shards and validator hardware requirements can be greatly reduced with smaller shard size. With future in-memory tries, it's also important to limit the size of individual shards.
What needs to be accomplished
The implementation should be robust enough so that we can later use it in Phase 2. The implementation should also allow for shard deletion in the future - meaning that any changes to the trie and the storage should support fast deletion.
Main use case
Once the project is completed we should be able to manually schedule a resharding of the largest shards in mainnet and testnet and the resharding should smoothly take place without any disruptions to the network.
Links to external documentations and discussions
Assumptions
- Flat storage is enabled.
- Shard split boundary is predetermined and hardcoded. In other words, necessity of shard splitting is manually decided.
- For the time being resharding as an event is only going to happen once but we would still like to have the infrastrcture in place to handle future resharding events with ease.
- Merkle Patricia Trie is the undelying data structure for the protocol state.
- Epoch is at least 6 hrs long for resharding to complete.
Pre-requisites
- Resharding must be fast enough so that both state sync and resharding can happen within one epoch.
- Resharding should work efficiently within the limits of the current hardware requirements for nodes.
- Potential failures in resharding may require intervention from node operator to recover.
- No transaction or receipt must be lost during resharding.
- Resharding must work regardless of number of existing shards.
- No apps, tools or code should hardcode the number of shards to 4.
Out of scope
- Dynamic resharding
- automatically scheduling resharding based on shard usage/capacity
- automatically determining the shard layout
- Merging shards or boundary adjustments
- Shard reshuffling
- Shard Layout determination logic. Shard boundaries are still determined offline and hardcoded.
Task list
mainnet release preparation
- support resharding on split storage archival nodes
- fix the bug when opening the snapshot
- add split storage support to mocknet (need help from the node team here)
- write a test for this case
- support node restart
- in the building state phase
- in the catch up phase
- in the post catch up phase
- add test coverage for all the above
mainnet release
- ensure lake nodes are upgraded
- pause backups to avoid unnecessary restarts
- check the health of state dumpers zulip
implementation
- [Resharding] Remove the fixed shards field from the shard layout #9202
- [resharding] Implement support for flat storage changes #9418
- Implement scheduling resharding logic - predefined hardcoded boundary
- [resharding] Implement building new tries by using flat storage. #9417
- Use flat storage at correct head
- check if can enable flat storage snapshot on archival nodes
- yes zulip
- consider using an on-demand snapshot instead of the state sync snapshot
- We decided to re-use the existing state sync snapshot solution due to simplicity
- enable snapshot on archival nodes either permanently or conditionally for resharding
- delete snapshot on archival nodes after resharding is done to minimize storage overhead #10149
- delay resharding until state snapshot is ready
- test it
- ensure flat storage snapshot exists or handle it missing
- check if can enable flat storage snapshot on archival nodes
- Fix outgoing receipts reassignment at shard layout switch
- Fix incoming receipts reassignment at shard layout switch
- Fix transaction pool losing transactions at shard layout switch
- Fix gas and balance reassignment at shard layout switch
- Implement* triggering resharding after catchup
- Implement* applying blocks to both old shard and new shards
- Implement* shard layout switch at epoch boundary
- Clean up old state and flat state after resharding is finished #10140
- @shreyan-gupta is the owner
- Pick a new boundary account to split an existing shard
- feat(resharding): implemented state-stats state-viewer command #10094
- feat(resharding): state-stats improvements #10119
- feat(resharding): added state split statistics to the state stats tool #10197
- feat: add database analyse-gas-usage command #10240
tge-lockup.sweat
is a good boundary from storage point of view and splits the shard 36:30:32token.sweat
is a good boundary from gas usage point of view and splits the shard 27:45:27
- Relevant security vulnerability https://github.com/near/nearcore-private/issues/67
- Fix the vulnerability fix: check shard_id early #10055
- Benchmark and throttle resharding to not impact node operation
- Testing and stabilization
- Test resharding in mocknet or betanet
- small network no traffic
- small network with traffic
- Test resharding on real rpc and archival nodes in mainnet and testnet
- Extend existing resharding tests to cover the V1->V2 migration
- feat: adding an integration test for the V1->V2 resharding #9345
- feat: split the simple resharding test to separately test the V1 and V2 reshardings #9373
- feat: adding cross_contract_calls integration test for the V1->V2 resharding #9402
- test: implemented a test for handling incoming receipts during resharding #9467
- test missing chunk in the first block of new shard layout zulip
- handling error and corner cases
- Querying historical transaction status should be compatible with re-sharding #5047
- Test resharding in mocknet or betanet
Operational
- NEP
- Check if a new NEP is needed zulip
- Write the NEP draft
- SME review
- protocol working group meeting
- @walnut-the-cat is the owner
- build monitoring and dashboard
- Handle the increase in expected workload for validators
- Update public documentation on resharding: https://near.github.io/nearcore/architecture/how/resharding.html
- @wacban is the owner
- update the docs
- Once the dynamic config is merged, update the instructions in throttling section.
- Once the NEP is merged update the link to it - currently the PR is linked.
- Make the block production the same in testnet and mainnet
- @posvyatokum is the owner
- Make throttling a dynamic config that can be updated without restarting the node
- [resharding] Add pytest for checking RPC calls after resharding #10296
- fix(resharding): fix dynamic config #10297
- test in mocknet - visible speed up in the batch size graph - https://nearinc.grafana.net/goto/aXE8CeNIR?orgId=1
- Add support for pseudo archival nodes in mocknet and use it in pre-release testings
- @posvyatokum is the owner
- Stress test resharding in mainnet and testnet
- mainnet rpc, mainnet archival, testnet rpc, testnet archival
- resharding takes about 3h00m in mainnet and 5h30min in testnet
- Test failure recovery
- start resharding in mocknet, crash a node in the middle of resharding, restart the node, make sure resharding still completes and all is fine - https://nearinc.grafana.net/goto/aAosJBvSR?orgId=1
- Notify all tools, infra, etc, developers are aware and prepared to handle the new shard layout.
- Prepare a RPC endpoint for querying the shard layout and the number of shards
- EXPERIMENTAL_protocol_config contains shard layout
- test that it works as expected
- document how to use it
- create a dedicated endpoint just for querying the shard layout
- EXPERIMENTAL_protocol_config contains shard layout
- Notify any other tools, infra, etc., developers and SRE about resharding
- slack general https://pagodaplatform.slack.com/archives/C01R19A6NHX/p1700517714470499
- slack near-releases https://pagodaplatform.slack.com/archives/C02C1NLNVL4/p1701781284466569
- slack sre-support
- zulip general
- devhub
- near social
- telegram NEAR protocol community group
- telegram NEAR Tools community group
- telegram NEAR Dev group
- Discord Dev-updates
- explorer
- databricks
- from slack "from the Databricks side we should be good if the S3 files continues with the same name pattern shard.json"
- query api
- Prepare a RPC endpoint for querying the shard layout and the number of shards
- Make resharding interruptible - currently sigint is not enough to stop the node in the middle of resharding
- Stabilize
Code Quality improvements
- Fix the shard id type inconsistency in ShardUId and in ShardId. Proper fix may require db migration but perhaps we can fix at the db api level.
- Rename from "state split" to "resharding".
- refactor(resharding): renaming resharding code from "state split" to "resharding" #10393
- rename the db column - see this comment
Delayed until after the first rollout
- localize resharding only to relevant shards and improve shard naming
- Change that way shards are identified so that when resharding we only need to touch the relevant shards.
- Today due to having version in the ShardUId we always need to "reshard" all shards even if only one is getting split.
- state sync and resharding integration
- resharding discussion, Madrid Offsite
- resharding and state sync issue, zulip
- fix resharding of incoming receipt proofs
- fix or pause state dumping during resharding + tests zulip
- stateless validation and resharding integration (zulip)
- in memory trie and resharding integration
- Add a provisional V3 resharding in nightly to have the resharding tests check the stable -> nightly transition.
- Set the trie_cache size based on account rather than shard uids so that we don't need to update it with every resharding code ref
Brainstorming COMPLETED
-
Brainstorm ideas.
-
Evaluate the available solutions.
- [Resharding] Estimate cost of the Trie Shallow Copying solution #9101
- The results are promising but the type boundary split is expected to be even faster.
- [Resharding] Offline prototype for using Flat Storage to reconstruct trie #9105
- The results are promising but the type boundary split is expected to be even faster.
- [P0][Resharding] Investigate the feasibility of shard deletion by reference counting #9198
- seems possible but quite complex and bug prone
- can't directly delete shard
- best approach is to GC back till the flat storage head and delete remaining data using flat storage
- [Resharding] Investigate the feasibility of the <account id><type> trie structure. #9199
- This trie structure is feasible from the point of view of the current trie usage.
- There is a minor blocker but it can be resolved relatively easy.
- This solution should be evaluated once implemented to ensure that it doesn't introduce a performance regression.
- [P0][Resharding] Investigate the feasibility of the <account id><node hash> storage structure #9200
- Would require a dedicated, separate solution to address deduplication for all trie data types.
- Would be nice for contracts with large data as it would put it close to each other in the trie.
- [Resharding] Benchmark the type boundary split algorithm #9201
- no longer needed
- [P2][Resharding] Investigate the migration feasibility of <account id> <type> #9203
- no longer needed
- [P0][Resharding] Investigate the migration feasibility of <account id> <node hash> #9204
- Migration is not feasible on archival nodes.
- Archival nodes may be getting depracated by read-rpc but it won't be soon.
- May still be possible by maintaining both implementations and keeping old data in the old format.
- [P1][Resharding] Investigate the migration feasibility of <node hash> #9205
- Likely possible but need to check how long it would take on archival nodes.
- [P2][Resharding] Investigate the resharding feasibility of <shard uid> <node hash> #9206
- no owner
- Investigate what's the best way to represent shards across resharding
- no owner
- Do we want to keep shard_uid or is there anything better?
- [Resharding] Estimate cost of the Trie Shallow Copying solution #9101
-
Select the best solutions
- Select the best solution for the trie structrue and reshardings.
- We chose to keep the existing trie structure and use flat storage for resharding.
- Select the best solution for the storage structrue and deletion
- We chose to keep the existing storage structure and use range deletion for deletion.
'*' - implement or verify existing implementation
- Select the best solution for the trie structrue and reshardings.
Metadata
Assignees
Type
Projects
Status
Done