Skip to content

Commit

Permalink
docs: more information on trie (#12381)
Browse files Browse the repository at this point in the history
That should be a go-to page if you have any questions about storage.

There are indeed many pieces missing:
* how memtrie works
* how storage costs are structured
* how storage proof is generated

but it would take much more time to describe it all.
  • Loading branch information
Longarithm authored Nov 5, 2024
1 parent 7294163 commit 587c357
Show file tree
Hide file tree
Showing 4 changed files with 186 additions and 31 deletions.
49 changes: 49 additions & 0 deletions docs/architecture/storage/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Storage

## Overview

The storage subsystem of nearcore is complex and has many layers. This documentation
provides an overview of the common usecases and explains the most important
implementation details.

The main requirements are:
- Low but predictable latency for reads
- Proof generation for chunk validators

Additionally, operations with contract storage require precise gas cost computation.

## Examples

### Contract Storage

Contract-specific storage is exposed in e.g. `storage_read` and `storage_write`
methods which contracts can call.

### Runtime State Access

When Account, AccessKey, ... are read or updated during tx and receipt processing,
it is always a storage operation. Examples of that are token transfers and access
key management. More advanced example is management of receipt queues. If receipt
cannot be processed in the current block, it is added to the delayed queue, which
is part of the storage.

## RPC Node Operations

When RPC or validator node receives a transaction, it needs to do validity check.
It involves reading Accounts and AccessKeys, which is also a storage operation.

Another query to RPC node is a view call - state query without modification or
contract dry run.

## Focus

In the documentation, we focus the most on the **Contract Storage** use case because
it has the strictest requirements.

For the high-level flow, refer to [Flow diagram](./flow.md).
For more detailed information, refer to:

* [Primitives](./primitives.md)
* [Trie Storage](./trie_storage.md)
* [Flat Storage](./flat_storage.md)
* [Database](./database.md)
20 changes: 15 additions & 5 deletions docs/architecture/storage/flow.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,19 @@
# Read and Write Flow for Storage Requests
# Flow

The storage subsystem of nearcore is complex and has many layers. Here we
present the flow of a single read or write request from the transaction runtime
Here we present the flow of a single read or write request from the transaction runtime
all the way to the OS. As you can see, there are many layers of read-caching and
write-buffering involved.

<!-- https://docs.google.com/presentation/d/1kHR8ONffUaCaBiJ4KM23h1tcfe4Z-_yKn2gaqlExaiY/edit#slide=id.p -->
![Diagram with read and write request flow](https://user-images.githubusercontent.com/6342444/215088748-028b754f-16be-4f56-9edd-6ce58ff1c9ef.svg)
Blue arrow means a call triggered by read.

Red arrow means a call triggered by write.

Black arrow means a non-trivial data dependency. For example:
* Nodes which are read on TrieStorage go to TrieRecorder to generate proof, so they
are connected with black arrow.
* Memtrie lookup needs current state of accounting cache to compute costs. When
query completes, accounting cache is updated with memtrie nodes. So they are connected
with bidirectional black arrow.

<!-- Editable source: https://docs.google.com/presentation/d/1_iU5GfznFDUMUNi_7szBRd5hDrjqBxr8ap7eTCK-lZA/edit#slide=id.p -->
![Diagram with read and write request flow](https://github.com/user-attachments/assets/232ae746-3f86-4a15-8a3a-08a544a88834)
101 changes: 101 additions & 0 deletions docs/architecture/storage/primitives.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# Primitives

## TrieKey

Describes all keys which may be inserted to storage:

* `Account`
* `ContractCode`
* `AccessKey`
* `ReceivedData`
* `PostponedReceiptId`
* `PendingDataCount`
* `PostponedReceipt`
* `DelayedReceiptIndices`
* `DelayedReceipt`
* `ContractData`
* `PromiseYieldIndices`
* `PromiseYieldTimeout`
* `PromiseYieldReceipt`
* `BufferedReceiptIndices`
* `BufferedReceipt`
* `BandwidthSchedulerState`

Each key is uniquely converted to `Vec<u8>`. Internally, each such vector is
converted to `NibbleSlice` (nibble is a half of a byte), and each its item
corresponds to one step down in trie.

## ValueRef

```
ValueRef {
length: u32,
hash: CryptoHash,
}
```

Reference to value corresponding to trie key (Account, AccessKey, ContractData...).
Contains value length and value hash. The full value must be read from disk from
`DBCol::State`, where DB key is the `ValueRef::hash`.

It is needed because reading long values has a significant latency. Therefore for
such values we read `ValueRef` first, charge user for latency needed to read value
of length which is known now, and then read the full value.

## OptimizedValueRef

However, majority of values is short, and it would be a waste of resources to
always spend two reads on them. That's why the read result is `OptimizedValueRef`,
which contains

* `AvailableValue` if value length is <= `INLINE_DISK_VALUE_THRESHOLD` which
is 4000;
* `Ref` which is `ValueRef` otherwise.

## get_optimized_ref

Finally, the main method to serve reads is

`Trie::get_optimized_ref(key: &[u8], mode: KeyLookupMode) -> Option<OptimizedValueRef>`.

Key is encoded trie key. For example, `TrieKey::Account { account_id }` to read
account info.

`Option<OptimizedValueRef>` is a reference for value, which we discussed before.

`mode` is a costs mode used for serving the read.

To get actual value, contract runtime calls `Trie::deref_optimized(optimized_value_ref: &OptimizedValueRef)`
which either makes a full value read or returns already stored value.

## KeyLookupMode

Cost mode used to serve the read.

* `KeyLookupMode::Trie` - user must pay for every trie node read needed to find the key.
* `KeyLookupMode::FlatStorage` - user must pay only for dereferencing the value.

It is based on the need to write new nodes to storage. For read queries it is
enough to get the value; for write queries we also have to update the value and
hashes of all the nodes on the path from root to the leaf corresponding to key.

However, this way to distinguish reads and writes is outdated, because in the
stateless validation world, chunk producers have to read nodes anyway to generate
a proof for chunk validators. But we still maintain it because it helps RPC nodes
which don't generate proofs. Also, getting a cost for `Trie` mode is *itself* a
challenging task because node has to keep reading nodes until it reaches the leaf.

## lookup

Family of functions accepting `key: &[u8], side_effects: bool`. They find node
corresponding to the key in trie and return a value which is later converted
to `Option<OptimizedValueRef>`. Implementation depends on the main storage source:

* `lookup_from_memory`
* `lookup_from_flat_storage`
* `lookup_from_state_column`

Expected side effects are

* changing accessed node counters so that contract runtime can charge caller based on them,
* recording accessed nodes for a proof for chunk validators.
Original file line number Diff line number Diff line change
Expand Up @@ -13,11 +13,15 @@ Runtime.
### Trie

Trie stores the state - accounts, contract codes, access keys, etc. Each state
item corresponds to the unique trie key. All types of trie keys are described in
the [TrieKey](#triekey) section. You can read more about this structure on
item corresponds to the unique trie key. You can read more about this structure on
[Wikipedia](https://en.wikipedia.org/wiki/Trie).

Trie is stored in the RocksDB, which is persistent across node restarts. Trie
There are two ways to access trie - from memory and from disk. The first one is
currently the main one, where only the loading stage requires disk, and the
operations are fully done in memory. The latter one relies only on disk with
several layers of caching. Here we describe the disk trie.

Disk trie is stored in the RocksDB, which is persistent across node restarts. Trie
communicates with database using `TrieStorage`. On the database level, data is
stored in key-value format in `DBCol::State` column. There are two kinds of
records:
Expand Down Expand Up @@ -50,6 +54,15 @@ Update is prepared as follows:
* call `finalize` method which prepares `TrieChanges` and state changes based on
`committed` field.

Prospective changes correspond to intermediate state updates, which can be
discarded if the transaction is considered invalid (because of insufficient
balance, invalidity, etc.). While they can't be applied yet, they must be cached
this way if the updated keys are accessed again in the same transaction.

Committed changes are stored in memory across transactions and receipts.
Similarly, they must be cached if the updated keys are accessed across
transactions. They can be discarded only if the chunk is discarded.

Note that `finalize`, `Trie::insert` and `Trie::update` do not update the
database storage. These functions only modify trie nodes in memory. Instead,
these functions prepare the `TrieChanges` object, and `Trie` is actually updated
Expand Down Expand Up @@ -77,28 +90,6 @@ Each shard within `ShardTries` has their own `cache` and `view_cache`. The `cach

## Primitives

### TrieKey

Describes all keys which may be inserted to `Trie`:

* `Account`
* `ContractCode`
* `AccessKey`
* `ReceivedData`
* `PostponedReceiptId`
* `PendingDataCount`
* `PostponedReceipt`
* `DelayedReceiptIndices`
* `DelayedReceipt`
* `ContractData`
* `YieldedPromiseQueueIndices`
* `YieldedPromiseQueueEntries`
* `PromiseYieldReceipt`

Each key is uniquely converted to `Vec<u8>`. Internally, each such vector is
converted to `NibbleSlice` (nibble is a half of a byte), and each its item
corresponds to one step down in `Trie`.

### TrieChanges

Stores result of updating `Trie`.
Expand All @@ -109,6 +100,10 @@ Stores result of updating `Trie`.
* `insertions`, `deletions`: vectors of `TrieRefcountChange`, describing all
inserted and deleted nodes.

This way to update trie allows to add new nodes to storage and remove old ones
separately. The former corresponds to saving new block, the latter - to garbage
collection of old block data which is no longer needed.

### TrieRefcountChange

Because we remove unused nodes during garbage collection, we need to track
Expand All @@ -125,4 +120,4 @@ This structure is used to update `rc` in the database:

Note that for all reference-counted records, the actual value stored in DB is
the concatenation of `trie_node_or_value` and `rc`. The reference count is
updated using a custom merge operation `merge_refcounted_records`.
updated using a custom merge operation `refcount_merge`.

0 comments on commit 587c357

Please sign in to comment.