docs: more information on trie (#12381)

That should be a go-to page if you have any questions about storage. There are indeed many pieces missing: * how memtrie works * how storage costs are structured * how storage proof is generated but it would take much more time to describe it all.
near · Nov 5, 2024 · 587c357 · 587c357
1 parent 7294163
commit 587c357
Show file tree

Hide file tree

Showing 4 changed files with 186 additions and 31 deletions.
diff --git a/docs/architecture/storage/README.md b/docs/architecture/storage/README.md
@@ -0,0 +1,49 @@
+# Storage
+
+## Overview
+
+The storage subsystem of nearcore is complex and has many layers. This documentation
+provides an overview of the common usecases and explains the most important
+implementation details.
+
+The main requirements are:
+- Low but predictable latency for reads
+- Proof generation for chunk validators
+
+Additionally, operations with contract storage require precise gas cost computation.
+
+## Examples
+
+### Contract Storage
+
+Contract-specific storage is exposed in e.g. `storage_read` and `storage_write`
+methods which contracts can call.
+
+### Runtime State Access
+
+When Account, AccessKey, ... are read or updated during tx and receipt processing,
+it is always a storage operation. Examples of that are token transfers and access
+key management. More advanced example is management of receipt queues. If receipt
+cannot be processed in the current block, it is added to the delayed queue, which
+is part of the storage.
+
+## RPC Node Operations
+
+When RPC or validator node receives a transaction, it needs to do validity check.
+It involves reading Accounts and AccessKeys, which is also a storage operation.
+
+Another query to RPC node is a view call - state query without modification or
+contract dry run. 
+
+## Focus
+
+In the documentation, we focus the most on the **Contract Storage** use case because
+it has the strictest requirements.
+
+For the high-level flow, refer to [Flow diagram](./flow.md). 
+For more detailed information, refer to:
+
+* [Primitives](./primitives.md)
+* [Trie Storage](./trie_storage.md)
+* [Flat Storage](./flat_storage.md)
+* [Database](./database.md)
diff --git a/docs/architecture/storage/flow.md b/docs/architecture/storage/flow.md
@@ -1,9 +1,19 @@
-# Read and Write Flow for Storage Requests
+# Flow
 
-The storage subsystem of nearcore is complex and has many layers. Here we
-present the flow of a single read or write request from the transaction runtime
+Here we present the flow of a single read or write request from the transaction runtime
 all the way to the OS. As you can see, there are many layers of read-caching and
 write-buffering involved.
 
-<!-- https://docs.google.com/presentation/d/1kHR8ONffUaCaBiJ4KM23h1tcfe4Z-_yKn2gaqlExaiY/edit#slide=id.p  -->
-![Diagram with read and write request flow](https://user-images.githubusercontent.com/6342444/215088748-028b754f-16be-4f56-9edd-6ce58ff1c9ef.svg)
+Blue arrow means a call triggered by read.
+
+Red arrow means a call triggered by write.
+
+Black arrow means a non-trivial data dependency. For example:
+* Nodes which are read on TrieStorage go to TrieRecorder to generate proof, so they
+are connected with black arrow.
+* Memtrie lookup needs current state of accounting cache to compute costs. When
+query completes, accounting cache is updated with memtrie nodes. So they are connected
+with bidirectional black arrow.
+
+<!-- Editable source: https://docs.google.com/presentation/d/1_iU5GfznFDUMUNi_7szBRd5hDrjqBxr8ap7eTCK-lZA/edit#slide=id.p  -->
+![Diagram with read and write request flow](https://github.com/user-attachments/assets/232ae746-3f86-4a15-8a3a-08a544a88834)
diff --git a/docs/architecture/storage/primitives.md b/docs/architecture/storage/primitives.md
@@ -0,0 +1,101 @@
+# Primitives
+
+## TrieKey
+
+Describes all keys which may be inserted to storage:
+
+* `Account`
+* `ContractCode`
+* `AccessKey`
+* `ReceivedData`
+* `PostponedReceiptId`
+* `PendingDataCount`
+* `PostponedReceipt`
+* `DelayedReceiptIndices`
+* `DelayedReceipt`
+* `ContractData`
+* `PromiseYieldIndices`
+* `PromiseYieldTimeout`
+* `PromiseYieldReceipt`
+* `BufferedReceiptIndices`
+* `BufferedReceipt`
+* `BandwidthSchedulerState`
+
+Each key is uniquely converted to `Vec<u8>`. Internally, each such vector is
+converted to `NibbleSlice` (nibble is a half of a byte), and each its item
+corresponds to one step down in trie.
+
+## ValueRef
+
+```
+ValueRef {
+    length: u32,
+    hash: CryptoHash,
+}
+```
+
+Reference to value corresponding to trie key (Account, AccessKey, ContractData...).
+Contains value length and value hash. The full value must be read from disk from
+`DBCol::State`, where DB key is the `ValueRef::hash`.
+
+It is needed because reading long values has a significant latency. Therefore for
+such values we read `ValueRef` first, charge user for latency needed to read value
+of length which is known now, and then read the full value.
+
+## OptimizedValueRef
+
+However, majority of values is short, and it would be a waste of resources to
+always spend two reads on them. That's why the read result is `OptimizedValueRef`,
+which contains
+
+* `AvailableValue` if value length is <= `INLINE_DISK_VALUE_THRESHOLD` which
+is 4000;
+* `Ref` which is `ValueRef` otherwise.
+
+## get_optimized_ref
+
+Finally, the main method to serve reads is
+
+`Trie::get_optimized_ref(key: &[u8], mode: KeyLookupMode) -> Option<OptimizedValueRef>`.
+
+Key is encoded trie key. For example, `TrieKey::Account { account_id }` to read
+account info.
+
+`Option<OptimizedValueRef>` is a reference for value, which we discussed before.
+
+`mode` is a costs mode used for serving the read.
+
+To get actual value, contract runtime calls `Trie::deref_optimized(optimized_value_ref: &OptimizedValueRef)`
+which either makes a full value read or returns already stored value.
+
+## KeyLookupMode
+
+Cost mode used to serve the read.
+
+* `KeyLookupMode::Trie` - user must pay for every trie node read needed to find the key.
+* `KeyLookupMode::FlatStorage` - user must pay only for dereferencing the value.
+
+It is based on the need to write new nodes to storage. For read queries it is
+enough to get the value; for write queries we also have to update the value and
+hashes of all the nodes on the path from root to the leaf corresponding to key.
+
+However, this way to distinguish reads and writes is outdated, because in the
+stateless validation world, chunk producers have to read nodes anyway to generate
+a proof for chunk validators. But we still maintain it because it helps RPC nodes
+which don't generate proofs. Also, getting a cost for `Trie` mode is *itself* a
+challenging task because node has to keep reading nodes until it reaches the leaf.
+
+## lookup
+
+Family of functions accepting `key: &[u8], side_effects: bool`. They find node
+corresponding to the key in trie and return a value which is later converted
+to `Option<OptimizedValueRef>`. Implementation depends on the main storage source:
+
+* `lookup_from_memory`
+* `lookup_from_flat_storage`
+* `lookup_from_state_column`
+
+Expected side effects are
+
+* changing accessed node counters so that contract runtime can charge caller based on them,
+* recording accessed nodes for a proof for chunk validators.
diff --git a/docs/architecture/storage/trie.md → docs/architecture/storage/trie_storage.md b/docs/architecture/storage/trie.md → docs/architecture/storage/trie_storage.md
@@ -13,11 +13,15 @@ Runtime.
 ### Trie
 
 Trie stores the state - accounts, contract codes, access keys, etc. Each state
-item corresponds to the unique trie key. All types of trie keys are described in
-the [TrieKey](#triekey) section. You can read more about this structure on
+item corresponds to the unique trie key. You can read more about this structure on
 [Wikipedia](https://en.wikipedia.org/wiki/Trie).
 
-Trie is stored in the RocksDB, which is persistent across node restarts. Trie
+There are two ways to access trie - from memory and from disk. The first one is 
+currently the main one, where only the loading stage requires disk, and the
+operations are fully done in memory. The latter one relies only on disk with
+several layers of caching. Here we describe the disk trie.
+
+Disk trie is stored in the RocksDB, which is persistent across node restarts. Trie
 communicates with database using `TrieStorage`. On the database level, data is
 stored in key-value format in `DBCol::State` column. There are two kinds of
 records:
@@ -50,6 +54,15 @@ Update is prepared as follows:
 * call `finalize` method which prepares `TrieChanges` and state changes based on
   `committed` field.
 
+Prospective changes correspond to intermediate state updates, which can be
+discarded if the transaction is considered invalid (because of insufficient
+balance, invalidity, etc.). While they can't be applied yet, they must be cached
+this way if the updated keys are accessed again in the same transaction.
+
+Committed changes are stored in memory across transactions and receipts.
+Similarly, they must be cached if the updated keys are accessed across
+transactions. They can be discarded only if the chunk is discarded.
+
 Note that `finalize`, `Trie::insert` and `Trie::update` do not update the
 database storage. These functions only modify trie nodes in memory. Instead,
 these functions prepare the `TrieChanges` object, and `Trie` is actually updated
@@ -77,28 +90,6 @@ Each shard within `ShardTries` has their own `cache` and `view_cache`. The `cach
 
 ## Primitives
 
-### TrieKey
-
-Describes all keys which may be inserted to `Trie`:
-
-* `Account`
-* `ContractCode`
-* `AccessKey`
-* `ReceivedData`
-* `PostponedReceiptId`
-* `PendingDataCount`
-* `PostponedReceipt`
-* `DelayedReceiptIndices`
-* `DelayedReceipt`
-* `ContractData`
-* `YieldedPromiseQueueIndices`
-* `YieldedPromiseQueueEntries`
-* `PromiseYieldReceipt`
-
-Each key is uniquely converted to `Vec<u8>`. Internally, each such vector is
-converted to `NibbleSlice` (nibble is a half of a byte), and each its item
-corresponds to one step down in `Trie`.
-
 ### TrieChanges
 
 Stores result of updating `Trie`.
@@ -109,6 +100,10 @@ Stores result of updating `Trie`.
 * `insertions`, `deletions`: vectors of `TrieRefcountChange`, describing all
   inserted and deleted nodes.
 
+This way to update trie allows to add new nodes to storage and remove old ones
+separately. The former corresponds to saving new block, the latter - to garbage
+collection of old block data which is no longer needed.
+
 ### TrieRefcountChange
 
 Because we remove unused nodes during garbage collection, we need to track
@@ -125,4 +120,4 @@ This structure is used to update `rc` in the database:
 
 Note that for all reference-counted records, the actual value stored in DB is
 the concatenation of `trie_node_or_value` and `rc`. The reference count is
-updated using a custom merge operation `merge_refcounted_records`.
+updated using a custom merge operation `refcount_merge`.