Limit memory consumption of flat storage deltas #8436

exalate-issue-sync · 2023-01-25T10:51:36Z

Summary

With flat storage, we start relying on fact that size of recent state updates stored in memory is well limited. Otherwise, if blocks are not finalized for long time, all flat storage deltas will be stored in memory and node will crash. Moreover, on restart it will try to put deltas to memory and crash again.

The idea is to store limited amount of deltas in memory. Limit on delta size is ~50 MB for 4 shards based on storage costs. The main contributor is key length, because we store ValueRefs as values which have limited size. Let's say that we store only 10 deltas - then memory usage will be 500 MB. Other deltas will be read from disk.

The weakness of this solution is that if there is no finality, reading deltas from disk may slow BP down significantly. On the other hand, if there is no finality for >100 blocks, then we are likely to have other issues with chain and some manual intervention will be required. So the plan is to check if reading 100 deltas from disk does not introduce slowdown, and if not, we proceed with it.

Otherwise we can implement more heuristics, like "if block is too far from last final block, set gas limit in it to zero" which almost guarantees empty state updates: #8437

Assumptions

Upper bound on the number of updates in single chunk is 20k: block_gas_limit / wasm_storage_write_base = 1300 Tgas / 64 Ggas = 20k (calculations from the NEP).

Max keys size is 2KB

Approaches

Naive

We can cache HashMap<TrieKey, ValueRef> which results in max deltas size of 50MB.
It means that storing deltas for 20 such blocks will take 1GB of RAM

Use key hash

Let's use TrieKey hash instead of the whole key, so we store 32 bytes per key regardless of the size. This way single max size of delta for a single block is 20k * (32B + 32B) = 1.28MB.
TODO (@pugachAG): check hashmap overhead
Storing 200 deltas would take 1.28MB * 200 = 256MB.
This is the approach we proceed with in MVP.

Use Bloom filter

With this approach we keep Bloom filter for keys for each delta (see zulip thread). This way we can check if the value was updated in each chunk and read the updated value from rocksdb.
Assuming 16 bits = 2B per value for Bloom filter this would require 20k * 2B = 40KB per delta, so we can keep deltas for multiple thousands of blocks.

Rely on rocksdb Bloom filters

With this approach we do not cache anything in memory and calls to rocksdb for delta. This relies on rocksdb-provided Bloom filters to be able to quickly determine that the key is absent in deltas for the given block.
This approach requires careful rocksdb tuning. With our current rocksdb configuration and synthetic data (500 blocks, 10k entires in each block, 300B key size) checking if key is present in deltas for a fixed block takes on average 0.8 - 3 μs, and highly depends on block cache size and compaction. So relying on that is too risky.
In the long term this seems like the most reasonable approach, but also the most time-consuming as it requires deep understanding of rocksdb internals as well as a lot of experimentation with both real-world and edge case data.

The text was updated successfully, but these errors were encountered:

Longarithm · 2023-01-25T12:56:07Z

After experimenting, I realized that we need to get rid of get_deltas_between_blocks and change DB format of deltas.

Current logic is naive and for each get reads all deltas between flat storage head —> current block and checks existence of key in each delta. During extreme case when there is no finality and all state updates take 50 MB, this would occupy too much memory. I think it is fine to give 1 GB RAM to flat storage, but >20 blocks would exceed that.

Proposal - change Delta format from (block_hash, shard_id) -> HashMap<Key, Value> to (block_hash, shard_id, key) -> Value.

Reading delta: DB.iter_prefix((block_hash, shard_id))
Writing delta: for k, v in delta: DB.write((block_hash, shard_id, k), v)
Removing delta: DB.delete_range((block_hash, shard_id, &[0]), (block_hash, shard_id, &[127]))
Get ValueRef from FS: iterate over parents and call DB.get

We cache up to ~10 deltas anyway, so for 99.9% of the time there are no additional disk reads.

mzhangmzz · 2023-01-27T16:49:37Z

@Longarithm My main concern here is that this may lead to drastic performance degradation when there are many unfinalized blocks in the system, and that may make the system go into a bad feedback loop when it is already in a bad state that the more unfinalized blocks there are, the longer block processing time is, and that makes it harder for blocks to be finalized.

The reformating makes sense, but I still want to combine that with the protocol change to limit the number of unfinalized blocks to 12 blocks.

You can also implement this first and do some performance testing later to see how much performance degrades when there are more than >20 blocks unfinalized.

Longarithm · 2023-02-15T10:24:09Z

I usually posted updates in Zulip but I'll try posting references here, as we align on Core team focus from github kanban.

New experiment results are very confusing. For finality delay of 60 blocks, even if we cache all deltas, get_ref performance jumps from 50us to 500us: https://near.zulipchat.com/#narrow/stream/345766-pagoda.2Fstorage.2Fflat-storage/topic/limiting.20RAM.20usage/near/327970531

I think about looking into get_ref more precisely and see from where the latency comes.

Longarithm · 2023-02-28T17:15:15Z

On the previous week, we agreed on our approach for MVP ("Use key hash") and updated the summary (see issue description).

Based on memory estimation, we can store deltas for around 50 full blocks, and after that we will display in-memory message that flat storage is likely to have troubles.

Longarithm · 2023-03-03T18:31:28Z

#8662 merged

Longarithm · 2023-03-09T20:02:59Z

We made experiment for nodes with 50 blocks finality lag and with 0 finality lag, and for cached deltas results are the same:

It is clearly visible on metrics that first node regularly "falls behind" by 50 blocks, and second node is not, but get_ref time is similar, so we can close the issue.

exalate-issue-sync bot assigned Longarithm Jan 25, 2023

Longarithm changed the title ~~Limit memory consumption of deltas~~ Limit memory consumption of flat storage deltas Jan 25, 2023

This was referenced Jan 25, 2023

Flat storage MVP #7327

Closed

Parameter weights tracking issue #8032

Closed

Longarithm mentioned this issue Jan 27, 2023

feat: get blocks until flat storage head instead of deltas #8451

Merged

Longarithm mentioned this issue Feb 7, 2023

feat: store items in flat storage deltas separately #8475

Closed

pugachAG self-assigned this Feb 22, 2023

pugachAG added the T-storage label Feb 22, 2023

Longarithm mentioned this issue Mar 2, 2023

feat: introduce cached flat storage deltas #8662

Merged

Longarithm closed this as completed Mar 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit memory consumption of flat storage deltas #8436

Limit memory consumption of flat storage deltas #8436

exalate-issue-sync bot commented Jan 25, 2023 •

edited by pugachAG

Loading

Longarithm commented Jan 25, 2023

mzhangmzz commented Jan 27, 2023

Longarithm commented Feb 15, 2023 •

edited

Loading

Longarithm commented Feb 28, 2023 •

edited

Loading

Longarithm commented Mar 3, 2023

Longarithm commented Mar 9, 2023

Limit memory consumption of flat storage deltas #8436

Limit memory consumption of flat storage deltas #8436

Comments

exalate-issue-sync bot commented Jan 25, 2023 • edited by pugachAG Loading

Summary

Assumptions

Approaches

Naive

Use key hash

Use Bloom filter

Rely on rocksdb Bloom filters

Longarithm commented Jan 25, 2023

mzhangmzz commented Jan 27, 2023

Longarithm commented Feb 15, 2023 • edited Loading

Longarithm commented Feb 28, 2023 • edited Loading

Longarithm commented Mar 3, 2023

Longarithm commented Mar 9, 2023

exalate-issue-sync bot commented Jan 25, 2023 •

edited by pugachAG

Loading

Longarithm commented Feb 15, 2023 •

edited

Loading

Longarithm commented Feb 28, 2023 •

edited

Loading