Skip to content

🔷 [Tracking issue] Resharding v2 #8992

Closed
@robin-near

Description

Goals

Background

The goal of the resharding project is to implement fast and robust resharding.

Why should NEAR One work on this

Currently, NEAR protocol has four shards. With more partners onboarding, we started seeing that some shards occasionally become over-crowded with respect to total state size and number of transactions. In addition, with state sync and stateless validation, validators will not need to track all shards and validator hardware requirements can be greatly reduced with smaller shard size. With future in-memory tries, it's also important to limit the size of individual shards.

What needs to be accomplished

The implementation should be robust enough so that we can later use it in Phase 2. The implementation should also allow for shard deletion in the future - meaning that any changes to the trie and the storage should support fast deletion.

Main use case

Once the project is completed we should be able to manually schedule a resharding of the largest shards in mainnet and testnet and the resharding should smoothly take place without any disruptions to the network.

Links to external documentations and discussions

Assumptions

  • Flat storage is enabled.
  • Shard split boundary is predetermined and hardcoded. In other words, necessity of shard splitting is manually decided.
  • For the time being resharding as an event is only going to happen once but we would still like to have the infrastrcture in place to handle future resharding events with ease.
  • Merkle Patricia Trie is the undelying data structure for the protocol state.
  • Epoch is at least 6 hrs long for resharding to complete.

Pre-requisites

  • Resharding must be fast enough so that both state sync and resharding can happen within one epoch.
  • Resharding should work efficiently within the limits of the current hardware requirements for nodes.
  • Potential failures in resharding may require intervention from node operator to recover.
  • No transaction or receipt must be lost during resharding.
  • Resharding must work regardless of number of existing shards.
  • No apps, tools or code should hardcode the number of shards to 4.

Out of scope

  • Dynamic resharding
    • automatically scheduling resharding based on shard usage/capacity
    • automatically determining the shard layout
  • Merging shards or boundary adjustments
  • Shard reshuffling
  • Shard Layout determination logic. Shard boundaries are still determined offline and hardcoded.

Task list

mainnet release preparation

  • support resharding on split storage archival nodes
    • fix the bug when opening the snapshot
    • add split storage support to mocknet (need help from the node team here)
    • write a test for this case
  • support node restart
    • in the building state phase
    • in the catch up phase
    • in the post catch up phase
    • add test coverage for all the above

mainnet release

  • ensure lake nodes are upgraded
  • pause backups to avoid unnecessary restarts
  • check the health of state dumpers zulip

implementation

Operational

Code Quality improvements

Delayed until after the first rollout

  • localize resharding only to relevant shards and improve shard naming
    • Change that way shards are identified so that when resharding we only need to touch the relevant shards.
    • Today due to having version in the ShardUId we always need to "reshard" all shards even if only one is getting split.
  • state sync and resharding integration
  • stateless validation and resharding integration (zulip)
  • in memory trie and resharding integration
  • Add a provisional V3 resharding in nightly to have the resharding tests check the stable -> nightly transition.
  • Set the trie_cache size based on account rather than shard uids so that we don't need to update it with every resharding code ref

Brainstorming COMPLETED

Metadata

Assignees

Labels

A-chainArea: Chain, client & relatedC-tracking-issueCategory: a tracking issueEpicNear CoreT-coreTeam: issues relevant to the core team

Type

No type

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions