Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit memory consumption of flat storage deltas #8436

Closed
Tracked by #7327
exalate-issue-sync bot opened this issue Jan 25, 2023 · 6 comments
Closed
Tracked by #7327

Limit memory consumption of flat storage deltas #8436

exalate-issue-sync bot opened this issue Jan 25, 2023 · 6 comments
Assignees

Comments

@exalate-issue-sync
Copy link

exalate-issue-sync bot commented Jan 25, 2023

Summary

With flat storage, we start relying on fact that size of recent state updates stored in memory is well limited. Otherwise, if blocks are not finalized for long time, all flat storage deltas will be stored in memory and node will crash. Moreover, on restart it will try to put deltas to memory and crash again.

The idea is to store limited amount of deltas in memory. Limit on delta size is ~50 MB for 4 shards based on storage costs. The main contributor is key length, because we store ValueRefs as values which have limited size. Let's say that we store only 10 deltas - then memory usage will be 500 MB. Other deltas will be read from disk.

The weakness of this solution is that if there is no finality, reading deltas from disk may slow BP down significantly. On the other hand, if there is no finality for >100 blocks, then we are likely to have other issues with chain and some manual intervention will be required. So the plan is to check if reading 100 deltas from disk does not introduce slowdown, and if not, we proceed with it.

Otherwise we can implement more heuristics, like "if block is too far from last final block, set gas limit in it to zero" which almost guarantees empty state updates: #8437

Assumptions

Upper bound on the number of updates in single chunk is 20k: block_gas_limit / wasm_storage_write_base = 1300 Tgas / 64 Ggas = 20k (calculations from the NEP).

Max keys size is 2KB

Approaches

Naive

We can cache HashMap<TrieKey, ValueRef> which results in max deltas size of 50MB.
It means that storing deltas for 20 such blocks will take 1GB of RAM

Use key hash

Let's use TrieKey hash instead of the whole key, so we store 32 bytes per key regardless of the size. This way single max size of delta for a single block is 20k * (32B + 32B) = 1.28MB.
TODO (@pugachAG): check hashmap overhead
Storing 200 deltas would take 1.28MB * 200 = 256MB.
This is the approach we proceed with in MVP.

Use Bloom filter

With this approach we keep Bloom filter for keys for each delta (see zulip thread). This way we can check if the value was updated in each chunk and read the updated value from rocksdb.
Assuming 16 bits = 2B per value for Bloom filter this would require 20k * 2B = 40KB per delta, so we can keep deltas for multiple thousands of blocks.

Rely on rocksdb Bloom filters

With this approach we do not cache anything in memory and calls to rocksdb for delta. This relies on rocksdb-provided Bloom filters to be able to quickly determine that the key is absent in deltas for the given block.
This approach requires careful rocksdb tuning. With our current rocksdb configuration and synthetic data (500 blocks, 10k entires in each block, 300B key size) checking if key is present in deltas for a fixed block takes on average 0.8 - 3 μs, and highly depends on block cache size and compaction. So relying on that is too risky.
In the long term this seems like the most reasonable approach, but also the most time-consuming as it requires deep understanding of rocksdb internals as well as a lot of experimentation with both real-world and edge case data.

@Longarithm Longarithm changed the title Limit memory consumption of deltas Limit memory consumption of flat storage deltas Jan 25, 2023
This was referenced Jan 25, 2023
@Longarithm
Copy link
Member

After experimenting, I realized that we need to get rid of get_deltas_between_blocks and change DB format of deltas.

Current logic is naive and for each get reads all deltas between flat storage head —> current block and checks existence of key in each delta. During extreme case when there is no finality and all state updates take 50 MB, this would occupy too much memory. I think it is fine to give 1 GB RAM to flat storage, but >20 blocks would exceed that.

Proposal - change Delta format from (block_hash, shard_id) -> HashMap<Key, Value> to (block_hash, shard_id, key) -> Value.

  • Reading delta: DB.iter_prefix((block_hash, shard_id))
  • Writing delta: for k, v in delta: DB.write((block_hash, shard_id, k), v)
  • Removing delta: DB.delete_range((block_hash, shard_id, &[0]), (block_hash, shard_id, &[127]))
  • Get ValueRef from FS: iterate over parents and call DB.get

We cache up to ~10 deltas anyway, so for 99.9% of the time there are no additional disk reads.

@mzhangmzz
Copy link
Contributor

@Longarithm My main concern here is that this may lead to drastic performance degradation when there are many unfinalized blocks in the system, and that may make the system go into a bad feedback loop when it is already in a bad state that the more unfinalized blocks there are, the longer block processing time is, and that makes it harder for blocks to be finalized.

The reformating makes sense, but I still want to combine that with the protocol change to limit the number of unfinalized blocks to 12 blocks.

You can also implement this first and do some performance testing later to see how much performance degrades when there are more than >20 blocks unfinalized.

@Longarithm
Copy link
Member

Longarithm commented Feb 15, 2023

I usually posted updates in Zulip but I'll try posting references here, as we align on Core team focus from github kanban.

New experiment results are very confusing. For finality delay of 60 blocks, even if we cache all deltas, get_ref performance jumps from 50us to 500us: https://near.zulipchat.com/#narrow/stream/345766-pagoda.2Fstorage.2Fflat-storage/topic/limiting.20RAM.20usage/near/327970531

I think about looking into get_ref more precisely and see from where the latency comes.

@pugachAG pugachAG self-assigned this Feb 22, 2023
@Longarithm
Copy link
Member

Longarithm commented Feb 28, 2023

On the previous week, we agreed on our approach for MVP ("Use key hash") and updated the summary (see issue description).

Based on memory estimation, we can store deltas for around 50 full blocks, and after that we will display in-memory message that flat storage is likely to have troubles.

@Longarithm
Copy link
Member

#8662 merged

@Longarithm
Copy link
Member

We made experiment for nodes with 50 blocks finality lag and with 0 finality lag, and for cached deltas results are the same:

It is clearly visible on metrics that first node regularly "falls behind" by 50 blocks, and second node is not, but get_ref time is similar, so we can close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants