-
Notifications
You must be signed in to change notification settings - Fork 677
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: more information on trie (#12381)
That should be a go-to page if you have any questions about storage. There are indeed many pieces missing: * how memtrie works * how storage costs are structured * how storage proof is generated but it would take much more time to describe it all.
- Loading branch information
1 parent
7294163
commit 587c357
Showing
4 changed files
with
186 additions
and
31 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
# Storage | ||
|
||
## Overview | ||
|
||
The storage subsystem of nearcore is complex and has many layers. This documentation | ||
provides an overview of the common usecases and explains the most important | ||
implementation details. | ||
|
||
The main requirements are: | ||
- Low but predictable latency for reads | ||
- Proof generation for chunk validators | ||
|
||
Additionally, operations with contract storage require precise gas cost computation. | ||
|
||
## Examples | ||
|
||
### Contract Storage | ||
|
||
Contract-specific storage is exposed in e.g. `storage_read` and `storage_write` | ||
methods which contracts can call. | ||
|
||
### Runtime State Access | ||
|
||
When Account, AccessKey, ... are read or updated during tx and receipt processing, | ||
it is always a storage operation. Examples of that are token transfers and access | ||
key management. More advanced example is management of receipt queues. If receipt | ||
cannot be processed in the current block, it is added to the delayed queue, which | ||
is part of the storage. | ||
|
||
## RPC Node Operations | ||
|
||
When RPC or validator node receives a transaction, it needs to do validity check. | ||
It involves reading Accounts and AccessKeys, which is also a storage operation. | ||
|
||
Another query to RPC node is a view call - state query without modification or | ||
contract dry run. | ||
|
||
## Focus | ||
|
||
In the documentation, we focus the most on the **Contract Storage** use case because | ||
it has the strictest requirements. | ||
|
||
For the high-level flow, refer to [Flow diagram](./flow.md). | ||
For more detailed information, refer to: | ||
|
||
* [Primitives](./primitives.md) | ||
* [Trie Storage](./trie_storage.md) | ||
* [Flat Storage](./flat_storage.md) | ||
* [Database](./database.md) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,19 @@ | ||
# Read and Write Flow for Storage Requests | ||
# Flow | ||
|
||
The storage subsystem of nearcore is complex and has many layers. Here we | ||
present the flow of a single read or write request from the transaction runtime | ||
Here we present the flow of a single read or write request from the transaction runtime | ||
all the way to the OS. As you can see, there are many layers of read-caching and | ||
write-buffering involved. | ||
|
||
<!-- https://docs.google.com/presentation/d/1kHR8ONffUaCaBiJ4KM23h1tcfe4Z-_yKn2gaqlExaiY/edit#slide=id.p --> | ||
![Diagram with read and write request flow](https://user-images.githubusercontent.com/6342444/215088748-028b754f-16be-4f56-9edd-6ce58ff1c9ef.svg) | ||
Blue arrow means a call triggered by read. | ||
|
||
Red arrow means a call triggered by write. | ||
|
||
Black arrow means a non-trivial data dependency. For example: | ||
* Nodes which are read on TrieStorage go to TrieRecorder to generate proof, so they | ||
are connected with black arrow. | ||
* Memtrie lookup needs current state of accounting cache to compute costs. When | ||
query completes, accounting cache is updated with memtrie nodes. So they are connected | ||
with bidirectional black arrow. | ||
|
||
<!-- Editable source: https://docs.google.com/presentation/d/1_iU5GfznFDUMUNi_7szBRd5hDrjqBxr8ap7eTCK-lZA/edit#slide=id.p --> | ||
![Diagram with read and write request flow](https://github.com/user-attachments/assets/232ae746-3f86-4a15-8a3a-08a544a88834) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,101 @@ | ||
# Primitives | ||
|
||
## TrieKey | ||
|
||
Describes all keys which may be inserted to storage: | ||
|
||
* `Account` | ||
* `ContractCode` | ||
* `AccessKey` | ||
* `ReceivedData` | ||
* `PostponedReceiptId` | ||
* `PendingDataCount` | ||
* `PostponedReceipt` | ||
* `DelayedReceiptIndices` | ||
* `DelayedReceipt` | ||
* `ContractData` | ||
* `PromiseYieldIndices` | ||
* `PromiseYieldTimeout` | ||
* `PromiseYieldReceipt` | ||
* `BufferedReceiptIndices` | ||
* `BufferedReceipt` | ||
* `BandwidthSchedulerState` | ||
|
||
Each key is uniquely converted to `Vec<u8>`. Internally, each such vector is | ||
converted to `NibbleSlice` (nibble is a half of a byte), and each its item | ||
corresponds to one step down in trie. | ||
|
||
## ValueRef | ||
|
||
``` | ||
ValueRef { | ||
length: u32, | ||
hash: CryptoHash, | ||
} | ||
``` | ||
|
||
Reference to value corresponding to trie key (Account, AccessKey, ContractData...). | ||
Contains value length and value hash. The full value must be read from disk from | ||
`DBCol::State`, where DB key is the `ValueRef::hash`. | ||
|
||
It is needed because reading long values has a significant latency. Therefore for | ||
such values we read `ValueRef` first, charge user for latency needed to read value | ||
of length which is known now, and then read the full value. | ||
|
||
## OptimizedValueRef | ||
|
||
However, majority of values is short, and it would be a waste of resources to | ||
always spend two reads on them. That's why the read result is `OptimizedValueRef`, | ||
which contains | ||
|
||
* `AvailableValue` if value length is <= `INLINE_DISK_VALUE_THRESHOLD` which | ||
is 4000; | ||
* `Ref` which is `ValueRef` otherwise. | ||
|
||
## get_optimized_ref | ||
|
||
Finally, the main method to serve reads is | ||
|
||
`Trie::get_optimized_ref(key: &[u8], mode: KeyLookupMode) -> Option<OptimizedValueRef>`. | ||
|
||
Key is encoded trie key. For example, `TrieKey::Account { account_id }` to read | ||
account info. | ||
|
||
`Option<OptimizedValueRef>` is a reference for value, which we discussed before. | ||
|
||
`mode` is a costs mode used for serving the read. | ||
|
||
To get actual value, contract runtime calls `Trie::deref_optimized(optimized_value_ref: &OptimizedValueRef)` | ||
which either makes a full value read or returns already stored value. | ||
|
||
## KeyLookupMode | ||
|
||
Cost mode used to serve the read. | ||
|
||
* `KeyLookupMode::Trie` - user must pay for every trie node read needed to find the key. | ||
* `KeyLookupMode::FlatStorage` - user must pay only for dereferencing the value. | ||
|
||
It is based on the need to write new nodes to storage. For read queries it is | ||
enough to get the value; for write queries we also have to update the value and | ||
hashes of all the nodes on the path from root to the leaf corresponding to key. | ||
|
||
However, this way to distinguish reads and writes is outdated, because in the | ||
stateless validation world, chunk producers have to read nodes anyway to generate | ||
a proof for chunk validators. But we still maintain it because it helps RPC nodes | ||
which don't generate proofs. Also, getting a cost for `Trie` mode is *itself* a | ||
challenging task because node has to keep reading nodes until it reaches the leaf. | ||
|
||
## lookup | ||
|
||
Family of functions accepting `key: &[u8], side_effects: bool`. They find node | ||
corresponding to the key in trie and return a value which is later converted | ||
to `Option<OptimizedValueRef>`. Implementation depends on the main storage source: | ||
|
||
* `lookup_from_memory` | ||
* `lookup_from_flat_storage` | ||
* `lookup_from_state_column` | ||
|
||
Expected side effects are | ||
|
||
* changing accessed node counters so that contract runtime can charge caller based on them, | ||
* recording accessed nodes for a proof for chunk validators. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters