Stateful -> stateless migration preparation #65

walnut-the-cat · 2024-04-10T21:04:32Z

Identify the right way to enable memtrie during the stateful-to-stateless validation protocol upgrade.
Before the upgrade all nodes will track all shards and they will start tracking only one or some shards after the upgrade. Memtrie is expected to be enabled in the case of tracking single or some shards, as it requires more memory. However, we may also need to think about how to make the transition from (stateful + disk tries) to (stateless + memtries). During this, we may need to run nodes with memtries while tracking all shards. Assuming this is the path to follow, we need to follow-ups.

Profile the memory usage needed to do that (a similar task opened for RPC nodes tracking all nodes:
Profile memory usage requirements of RPC and archival nodes in stateless validation nearcore#11230)
Make sure after the protocol upgrade the memtries for the shards that are not tracked in the new epoch are unloaded, leaving the nodes only with memtries for their tracked shards.

Related thread here.

walnut-the-cat · 2024-05-06T16:23:19Z

Need to think about how and when in-memory trie will be enabled

robin-near · 2024-05-07T16:00:51Z

Here are the options for dealing with the memtrie launch during stateful -> stateless migration:

(All, then One) Enable memtries first, confirming that memtries would be loaded for all shards, and then after the protocol upgrade, one memtrie would remain in memory (for the assigned shard) while the other memtries would be unloaded. The downside for this approach is all stateless chunk producers need to have high memory instances temporarily before the upgrade.
(Assigned Shard Only) Enable memtries first, but modifying it so that even in the stateful case, it would load only the shard that the node is assigned to (or if not validator, all shards). This makes it consistent with the stateless case. However, this may be difficult to implement. Suppose in the stateful case we have epochs E and E + 1, where in E we are assigned shard 1 and in E + 1 we're assigned shard 2. While in epoch E, we need to find some time to load shard 2 into memory for preparation of the next epoch, but since we're still tracking shard 2, the trie for shard 2 keeps changing. So we have two options:
- (Concurrent Load) Modify the memtrie loading code so that it supports loading a "hot shard", i.e. during concurrent changes to the shard; this may or may not be easy but at least involves freezing the flat storage like when we take snapshots.
- (Forced State Sync) Even though we track shard 2, we still force shard 2 to go through state sync, i.e. stop tracking the shard at the beginning of epoch E, pretend to do a state sync (but we already have the data so it's a noop), and then enter catchup. During catchup we will automatically first load the memtrie, and then will catch up until it's up to date, and then we would be tracking the shard again and everything continues normally. The problem here is that while we load for memtrie to load, this validator is not validating the shard, and that may become a security issue (though with 6 shards, is that really a problem?)
(None, then One) Enable memtries first, but modify the logic to only load memtrie for the stateless protocol. The challenge here is similar to Option 2 but only exists for the very last epoch before the protocol upgrade, because in preparation for the first stateless epoch we need to load memtrie for one shard.
(None, then None, then One) Enable memtries first, but modify the logic to only load memtrie since the second epoch after the protocol upgrade. This makes implementation easy, but the first epoch after the protocol upgrade will not have memtries and so would have degraded performance.
(None, then Manual) Do not enable memtries before the protocol upgrade; ask node operators to enable memtries after the protocol upgrade has passed. This is like Option 4 except chunk producers will be degraded until they take action themselves.

@bowenwang1996 I heard you're in favor of Option 1, could you confirm that is still the best option, after considering the other options available?

Regardless of the option picked, we would need to repeatedly test this protocol upgrade.

bowenwang1996 · 2024-05-09T17:52:18Z

@robin-near yes I still think option 1 is the best. 2 and 3 are too complex and add not just additional engineering complexity but also testing burden. 4 and 5 would likely degrade performance and we cannot control whether there will be high load on mainnet when it happens, so it is the best to avoid performance degradation altogether.

tayfunelmas · 2024-06-04T17:53:23Z

Based on today's discussion, unloading memtries will not be necessary since the validators will need to restart the nodes to downgrade the RAM size.

staffik · 2024-06-25T11:37:49Z

Unloading memtrie is done: near/nearcore#11657.

staffik · 2024-06-25T11:38:37Z

Testing the migration: https://near.zulipchat.com/#narrow/stream/308695-nearone.2Fprivate/topic/Forknet.2020-node.20statefull.20to.20stateless.20test.

staffik · 2024-06-27T12:49:54Z

It seems we have "shadow tracking" already implemented: https://github.com/near/nearcore/blob/master/core/chain-configs/src/client_config.rs#L412
I will test how it works with memtries and validator key hot swap.
cc @tayfunelmas @wacban

wacban · 2024-06-27T15:26:51Z

I think this may track the shard where the account id is located, not the shard that the validator with this account id would track. Either way this is a good find and definitely related, perhaps we can reuse some of it for our purpose?

Also the name is way less catchy than shadow tracking ;)

staffik · 2024-07-01T12:50:32Z

2024-07-01 (Monday) Update

We did the migration twice, including RPC and split-storage archival nodes, and everything looks ok.
Validator key hot swap + shadow tracking works fine, near-zero missed chunks.
Described results in a doc.
What's left is we gonna see how it works with reduced network bandwidth.

Part of: near/near-one-project-tracking#65 An option for non-validator node to track shards of given validator. During stateful -> stateless protocol upgrade a node will track all shards and will require a lot of RAM. After the migration we can move the validator key to a new, smaller node, that does not track all shards. To make it with minimal downtime, the new node needs to have appropriate shards in place and memtries loaded in memory, then we hot swap the validator key without stopping the new node. But before that happen the new node is not a validator and we need a way to tell it which validator's shards it should track.

walnut-the-cat mentioned this issue Apr 10, 2024

[ProjectTracking]: Stateless validation Mainnet Release #46

Open

52 tasks

walnut-the-cat assigned robin-near May 6, 2024

walnut-the-cat mentioned this issue May 6, 2024

Strategy for enabling memtries during stateful-to-stateless validation protocol upgrade near/nearcore#11240

Closed

walnut-the-cat assigned staffik and unassigned robin-near May 21, 2024

tayfunelmas mentioned this issue Jun 3, 2024

Mocknet testing for stateful-to-stateless protocol upgrade near/nearcore#11465

Closed

walnut-the-cat mentioned this issue Jun 25, 2024

[ProjectTracking]: Stateless validation MainNet launch prep #72

Open

28 tasks

staffik mentioned this issue Jun 28, 2024

feat: Shadow tracking near/nearcore#11689

Merged

staffik closed this as completed Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stateful -> stateless migration preparation #65

Stateful -> stateless migration preparation #65

walnut-the-cat commented Apr 10, 2024 •

edited by robin-near

Loading

walnut-the-cat commented May 6, 2024

robin-near commented May 7, 2024 •

edited

Loading

bowenwang1996 commented May 9, 2024

tayfunelmas commented Jun 4, 2024

staffik commented Jun 25, 2024

staffik commented Jun 25, 2024

staffik commented Jun 27, 2024

wacban commented Jun 27, 2024

staffik commented Jul 1, 2024 •

edited

Loading

Stateful -> stateless migration preparation #65

Stateful -> stateless migration preparation #65

Comments

walnut-the-cat commented Apr 10, 2024 • edited by robin-near Loading

walnut-the-cat commented May 6, 2024

robin-near commented May 7, 2024 • edited Loading

bowenwang1996 commented May 9, 2024

tayfunelmas commented Jun 4, 2024

staffik commented Jun 25, 2024

staffik commented Jun 25, 2024

staffik commented Jun 27, 2024

wacban commented Jun 27, 2024

staffik commented Jul 1, 2024 • edited Loading

walnut-the-cat commented Apr 10, 2024 •

edited by robin-near

Loading

robin-near commented May 7, 2024 •

edited

Loading

staffik commented Jul 1, 2024 •

edited

Loading