perf(l1): use Bytes for trie values to enable O(1) clones#6057
perf(l1): use Bytes for trie values to enable O(1) clones#6057pablodeymo wants to merge 38 commits intomainfrom
Conversation
…tory Reorganize state_healing.rs and storage_healing.rs into a shared sync/healing/ module structure with clearer naming conventions: - Create sync/healing/ directory with mod.rs, types.rs, state.rs, storage.rs - Rename MembatchEntryValue to HealingQueueEntry - Rename MembatchEntry to StorageHealingQueueEntry - Rename Membatch type to StorageHealingQueue - Rename children_not_in_storage_count to missing_children_count - Rename membatch variables to healing_queue throughout - Extract shared HealingQueueEntry and StateHealingQueue types to types.rs - Update sync.rs imports to use new healing module
Reorganize snap protocol code for better maintainability: - Split rlpx/snap.rs into rlpx/snap/ directory: - codec.rs: RLP encoding/decoding for snap messages - messages.rs: Snap protocol message types - mod.rs: Module re-exports - Split snap.rs into snap/ directory: - constants.rs: Snap sync constants and configuration - server.rs: Snap protocol server implementation - mod.rs: Module re-exports - Move snap server tests to dedicated tests/ directory - Update imports in p2p.rs, peer_handler.rs, and code_collector.rs
Document the phased approach for reorganizing snap sync code: - Phase 1: rlpx/snap module split - Phase 2: snap module split with server extraction - Phase 3: healing module unification
Split the large sync.rs (1631 lines) into focused modules: - sync/full.rs (~260 lines): Full sync implementation - sync_cycle_full(), add_blocks_in_batch(), add_blocks() - sync/snap_sync.rs (~1100 lines): Snap sync implementation - sync_cycle_snap(), snap_sync(), SnapBlockSyncState - store_block_bodies(), update_pivot(), block_is_stale() - validate_state_root(), validate_storage_root(), validate_bytecodes() - insert_accounts(), insert_storages() (both rocksdb and non-rocksdb) - sync.rs (~285 lines): Orchestration layer - Syncer struct with start_sync() and sync_cycle() - SyncMode, SyncError, AccountStorageRoots types - Re-exports for public API
…p/client.rs Move all snap protocol client-side request methods from peer_handler.rs to a dedicated snap/client.rs module: - request_account_range and request_account_range_worker - request_bytecodes - request_storage_ranges and request_storage_ranges_worker - request_state_trienodes - request_storage_trienodes Also moves related types: DumpError, RequestMetadata, SnapClientError, RequestStateTrieNodesError, RequestStorageTrieNodes. This reduces peer_handler.rs from 2,060 to 670 lines (~68% reduction), leaving it focused on ETH protocol methods (block headers/bodies). Added SnapClientError variant to SyncError for proper error handling. Updated plan_snap_sync.md to mark Phase 4 as complete.
…napError type Implement Phase 5 of snap sync refactoring plan - Error Handling. - Create snap/error.rs with unified SnapError enum covering all snap protocol errors - Update server functions (process_account_range_request, process_storage_ranges_request, process_byte_codes_request, process_trie_nodes_request) to return Result<T, SnapError> - Remove SnapClientError and RequestStateTrieNodesError, consolidate into SnapError - Keep RequestStorageTrieNodesError struct for request ID tracking in storage healing - Add From<SnapError> for PeerConnectionError to support error propagation in message handlers - Update sync module to use SyncError::Snap variant - Update healing modules (state.rs, storage.rs) to use new error types - Move DumpError struct to error.rs module - Update test return types to use SnapError - Mark Phase 5 as completed in plan document All phases of the snap sync refactoring are now complete.
Change missing_children_count from u64 to usize in HealingQueueEntry and node_missing_children function to match StorageHealingQueueEntry and be consistent with memory structure counting conventions.
Resolve conflicts from #5977 and #6018 merge to main: - Keep modular sync structure (sync.rs delegates to full.rs and snap_sync.rs) - Keep snap client code in snap/client.rs (removed from peer_handler.rs) - Add InsertingAccountRanges metric from #6018 to snap_sync.rs - Remove unused info import from peer_handler.rs
Greptile OverviewGreptile SummaryThis PR refactors the trie value type from Key changes:
Performance impact:
Confidence Score: 5/5
|
| Filename | Overview |
|---|---|
| crates/common/trie/node.rs | Updated ValueRLP type from Vec to Bytes, added Vec to ValueOrHash conversion |
| crates/common/trie/rkyv_utils.rs | Added BytesWrapper for rkyv serialization of Bytes type |
| crates/common/trie/trie.rs | Changed ValueRLP type alias from Vec to Bytes |
| crates/storage/api/mod.rs | Changed storage API to return Bytes and accept slice references for put operations |
| crates/storage/backend/rocksdb.rs | Updated RocksDB backend to return Bytes from get operations and accept slice references for put_batch |
| crates/storage/backend/in_memory.rs | Changed in-memory storage to use Bytes keys/values, copies on write as expected for testing |
| crates/storage/store.rs | Updated all storage calls to handle Bytes return type and convert put_batch calls to use slice references |
Sequence Diagram
sequenceDiagram
participant App as Application
participant Trie as Trie Layer
participant TrieDB as TrieDB Interface
participant Storage as Storage API
participant RocksDB as RocksDB Backend
Note over App,RocksDB: Read Path (Zero-Copy)
App->>Trie: get(key) -> Option<ValueRLP>
Trie->>TrieDB: get(Nibbles) -> Option<Vec<u8>>
TrieDB->>Storage: get(table, key) -> Option<Bytes>
Storage->>RocksDB: get_cf(cf, key)
RocksDB-->>Storage: Vec<u8>
Storage-->>TrieDB: Bytes::from(Vec) (zero-copy)
TrieDB-->>Trie: Vec<u8>
Trie-->>App: Bytes (O(1) clone)
Note over App,RocksDB: Write Path (Reference-Based)
App->>Trie: insert(key, value)
Trie->>TrieDB: put_batch(Vec<(Nibbles, Bytes)>)
TrieDB->>Storage: put_batch(table, &[(&[u8], &[u8])])
Storage->>RocksDB: put_cf(cf, key_slice, value_slice)
Note over RocksDB: RocksDB copies slices internally
RocksDB-->>Storage: Ok
Storage-->>TrieDB: Ok
TrieDB-->>Trie: Ok
Trie-->>App: Ok
Note over Trie: Cache Lookup (O(1) Clone)
Trie->>Trie: Clone Bytes (ref count increment)
Lines of code reportTotal lines added: Detailed view |
Kimi AI Code ReviewThis PR aims to improve the performance of trie operations by using Code Correctness and Potential Bugs
Security Vulnerabilities
Performance Implications
Rust Best Practices and Idiomatic Patterns
Memory Safety and Proper Error Handling
Code Readability and Maintainability
Ethereum-Specific Considerations
Overall, the PR looks good and addresses the performance optimization goal effectively. The changes are well-implemented and do not introduce any new issues. The use of Suggested Change:
// Use zero-copy conversion to avoid unnecessary data copies
let value = Bytes::copy_from_slice(value);Final Thoughts: The PR is well-executed and aligns with the goals of improving trie performance. It is ready to be merged after addressing the suggested change. Automated review by Kimi (Moonshot AI) |
|
Regarding the suggested comment for |
Kimi AI Code ReviewThis PR aims to improve trie value handling by using Code Correctness and Potential Bugs
Security Vulnerabilities
Performance Implications
Rust Best Practices and Idiomatic Patterns
Memory Safety and Proper Error Handling
Code Readability and Maintainability
Ethereum-Specific Considerations
ConclusionThe PR is well-executed, focusing on improving performance through the use of Suggested Change: Consider adding comments explaining the use of Overall, the PR is ready to be merged after incorporating the suggested change. Automated review by Kimi (Moonshot AI) |
Benchmark Block Execution Results Comparison Against Main
|
…copying-trie-leaves
in the multisync monitoring script (docker_monitor.py). The sync completion logs already contain per-phase completion markers (e.g. "✓ BLOCK HEADERS complete: 25,693,009 headers in 0:29:00") but this data was not surfaced in the Slack messages or run summaries. This adds a parse_phase_timings() function that reads saved container logs and extracts timing, count, and duration for all 8 snap sync phases: Block Headers, Account Ranges, Account Insertion, Storage Ranges, Storage Insertion, State Healing, Storage Healing, and Bytecodes. The breakdown is appended to both the Slack notification (as a code block per network instance) and the text-based run log (run_history.log and per-run summary.txt). When a phase did not complete (e.g. on a failed run), it is simply omitted from the breakdown.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR refactors trie/storage value handling to use reference-counted bytes::Bytes instead of Vec<u8> to make clones O(1) and enable more zero-copy paths (especially for RocksDB reads/writes).
Changes:
- Switch trie value RLP type to
Bytesand update trie node structures + serialization to support it. - Update the storage backend API: reads return
Bytes, and batched writes accept slices of(&[u8], &[u8])to reduce allocations. - Add snap server tests and update various storage read/decode call sites to work with
Bytes.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/storage/api/mod.rs | Changes storage API to return Bytes and take slice-based batches. |
| crates/storage/backend/rocksdb.rs | Makes RocksDB reads return Bytes and updates batch write signature. |
| crates/storage/backend/in_memory.rs | Switches in-memory tables to Bytes and updates read/write methods (but currently has trait-signature mismatches). |
| crates/storage/trie.rs | Adapts trie DB wrapper to Bytes values and slice-based put_batch. |
| crates/storage/store.rs | Updates decode paths and write batching to accommodate Bytes-returning storage reads. |
| crates/storage/layering.rs | Updates trie wrapper put_batch signature to use Bytes. |
| crates/common/trie/trie.rs | Changes ValueRLP to Bytes and adjusts insert/commit plumbing. |
| crates/common/trie/node/*.rs | Updates leaf/branch node value types + constructors and rkyv serialization for Bytes. |
| crates/common/trie/rlp.rs | Decodes trie node values into Bytes. |
| crates/common/trie/rkyv_utils.rs | Adds BytesWrapper remote for rkyv serialization. |
| crates/common/trie/node.rs | Adjusts commit accumulation for Bytes values (currently reintroduces copies in one spot). |
| crates/common/trie/verify_range.rs | Generalizes verification to accept value types as AsRef<[u8]>. |
| crates/common/trie/trie_sorted.rs | Adjusts empty/default values and conversions for Bytes. |
| crates/common/trie/logger.rs | Updates trie logger put/put_batch signatures to use Bytes. |
| crates/common/trie/db.rs | Updates TrieDB::put_batch to take Bytes and adapts helpers. |
| crates/networking/p2p/sync/healing/state.rs | Converts trie writes to Bytes when calling put_batch. |
| crates/networking/p2p/tests/snap_server_tests.rs | Adds snap protocol server tests (Hive-based AccountRange vectors). |
| CHANGELOG.md | Notes the performance change to Bytes trie values. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let header_value = txn.get(HEADERS, hash_key.as_slice())?; | ||
| let mut header = header_value | ||
| .map(|bytes| BlockHeaderRLP::from_bytes(bytes).to()) | ||
| .map(|bytes| BlockHeaderRLP::from_bytes(bytes.to_vec()).to()) |
There was a problem hiding this comment.
txn.get(HEADERS, ...) returns Bytes, but bytes.to_vec() allocates just to decode the header. This can be avoided by decoding directly from the Bytes slice (e.g., BlockHeader::decode(bytes.as_ref())) or by changing the RLP wrapper to accept Bytes/&[u8] without cloning.
| .map(|bytes| BlockHeaderRLP::from_bytes(bytes.to_vec()).to()) | |
| .map(|bytes| BlockHeaderRLP::from_bytes(bytes.as_ref()).to()) |
| fn put_batch(&self, key_values: Vec<(Nibbles, Bytes)>) -> Result<(), TrieError> { | ||
| let mut tx = self.db.begin_write().map_err(|e| { | ||
| TrieError::DbError(anyhow::anyhow!("Failed to begin write transaction: {}", e)) | ||
| })?; | ||
| for (key, value) in key_values { | ||
| let prefixed_key = self.make_key(key); | ||
| let table = self.table_for_key(&prefixed_key); | ||
| tx.put_batch(table, vec![(prefixed_key, value)]) | ||
| tx.put_batch(table, &[(&prefixed_key, value.as_ref())]) | ||
| .map_err(|e| TrieError::DbError(anyhow::anyhow!("Failed to write batch: {}", e)))?; | ||
| } |
There was a problem hiding this comment.
BackendTrieDB::put_batch loops and calls tx.put_batch(...) once per key/value pair (with a single-element slice), which largely defeats the purpose of batching and adds per-call overhead. Since table_for_key only routes to either nodes_table or fkv_table, consider grouping the incoming key_values by table and issuing one put_batch call per table.
| @@ -122,12 +123,11 @@ impl<'a> StorageReadView for InMemoryReadTx<'a> { | |||
| .map_err(|_| StoreError::Custom("Failed to acquire read lock".to_string()))?; | |||
|
|
|||
| let table_data = db.get(table).cloned().unwrap_or_default(); | |||
There was a problem hiding this comment.
This StorageReadView::prefix_iterator impl still takes table: &str, but the trait requires table: &'static str (see crates/storage/api/mod.rs). Update the method signature accordingly (and ensure any other StorageReadView/StorageWriteBatch methods in this file match the trait after the API change).
| impl<'a> StorageReadView for InMemoryReadTx<'a> { | ||
| fn get(&self, table: &str, key: &[u8]) -> Result<Option<Vec<u8>>, StoreError> { | ||
| fn get(&self, table: &str, key: &[u8]) -> Result<Option<Bytes>, StoreError> { | ||
| let db = self |
There was a problem hiding this comment.
StorageReadView::get requires table: &'static str (per crates/storage/api/mod.rs), but this impl uses table: &str, which doesn't match the trait method signature and will not compile. Update the parameter type here (and keep other StorageReadView method signatures consistent).
| node.encode(&mut buf); | ||
| let hash = *hash.get_or_init(|| NodeHash::from_encoded(&buf)); | ||
| if let Node::Leaf(leaf) = node.as_ref() { | ||
| acc.push((path.concat(&leaf.partial), leaf.value.clone())); | ||
| acc.push((path.concat(&leaf.partial), leaf.value.to_vec())); | ||
| } | ||
| acc.push((path, buf)); |
There was a problem hiding this comment.
ValueRLP is now Bytes (cheap to clone), but NodeRef::commit converts leaf values with leaf.value.to_vec(), reintroducing an O(n) copy per leaf. If these leaf values are intended to be written as flat-key-value entries, consider changing the accumulator to store Bytes (or otherwise avoid copying) so commits preserve the intended O(1) clone behavior.
| let block_body_opt = txn | ||
| .get(BODIES, &hash_key)? | ||
| .map(|bytes| BlockBodyRLP::from_bytes(bytes).to()) | ||
| .map(|bytes| BlockBodyRLP::from_bytes(bytes.to_vec()).to()) |
There was a problem hiding this comment.
txn.get(BODIES, ...) now returns Bytes, but this code does bytes.to_vec() just to decode the body. This adds an avoidable allocation/copy per body. Prefer decoding directly from the Bytes buffer (e.g., BlockBody::decode(bytes.as_ref())) or adjusting the RLP wrapper to accept Bytes/&[u8] without cloning.
| .map(|bytes| BlockBodyRLP::from_bytes(bytes.to_vec()).to()) | |
| .map(|bytes| BlockBodyRLP::from_bytes(bytes.as_ref()).to()) |
| let Some(block_body) = txn | ||
| .get(BODIES, &hash_key)? | ||
| .map(|bytes| BlockBodyRLP::from_bytes(bytes).to()) | ||
| .map(|bytes| BlockBodyRLP::from_bytes(bytes.to_vec()).to()) | ||
| .transpose() | ||
| .map_err(StoreError::from)? |
There was a problem hiding this comment.
Same issue as above: bytes.to_vec() is only used to feed BlockBodyRLP::from_bytes, causing an unnecessary allocation/copy. Decode directly from the returned Bytes (or extend the wrapper API) to keep reads zero-copy.
|
Phase Breakdown (from runs 103–113, perf/avoid-copying-trie-leaves) Key Observations
|
ElFantasma
left a comment
There was a problem hiding this comment.
The PR description claims "Trie cache lookups: O(1) Bytes clones instead of O(n) Vec copies" — but TrieLayerCache (in layering.rs) still stores FxHashMap<Vec<u8>, Vec<u8>> and does value.clone() on reads (line 84). The layer cache was not updated, so that claimed benefit is not realized. The actual benefit is within the trie node structures themselves (cloning LeafNode/BranchNode values is cheaper). If updating the layer cache is planned as a follow-up, might be worth noting that.
Also, BackendTrieDB::get and BackendTrieDBLocked::get add .to_vec() to convert Bytes back to Vec<u8> (since TrieDB::get returns Vec<u8>). This means the RocksDB zero-copy read benefit doesn't flow through to the trie — it would require changing TrieDB::get to return Bytes as well, which is a larger change.
| tokio::task::spawn_blocking(move || { | ||
| let txn = backend.begin_read()?; | ||
| txn.get(table, &key) | ||
| txn.get(table, &key).map(|opt| opt.map(|b| b.to_vec())) |
There was a problem hiding this comment.
This .to_vec() introduces an extra copy for every non-trie read compared to the previous code. Before: RocksDB returned Vec<u8> directly. Now: RocksDB returns Vec<u8>, wraps in Bytes (zero-copy), then .to_vec() copies it back. Net effect is one additional allocation for all callers of read_async.
Same applies to read() below at line 1238.
This is fine as a transitional step if the plan is to eventually propagate Bytes to callers, but worth acknowledging as a minor regression on the non-trie read path.
| use std::sync::{Arc, RwLock}; | ||
|
|
||
| type Table = BTreeMap<Vec<u8>, Vec<u8>>; | ||
| type Table = BTreeMap<Bytes, Bytes>; |
There was a problem hiding this comment.
nit: Using Bytes for BTreeMap keys adds atomic refcount overhead that's never useful (keys are never shared/cloned for zero-copy benefit). Consider keeping keys as Vec<u8> and only using Bytes for values: BTreeMap<Vec<u8>, Bytes>. Very minor since in-memory is testing-only.
| pub fn commit(&mut self) -> Result<(), TrieError> { | ||
| let acc = self.commit_without_storing(); | ||
| // Zero-copy conversion: Bytes::from(Vec<u8>) takes ownership | ||
| let acc = acc.into_iter().map(|(k, v)| (k, v.into())).collect(); |
There was a problem hiding this comment.
nit: This .collect() allocates a new Vec just to convert Vec<u8> → Bytes. Since commit_without_storing() already builds the vec internally, you could have it return Vec<(Nibbles, Bytes)> directly (doing the .into() inside), avoiding this intermediate collection. Alternatively, put_batch could accept an iterator.
| TrieError::DbError(anyhow::anyhow!("Failed to begin read transaction: {}", e)) | ||
| })?; | ||
| tx.get(table, prefixed_key.as_ref()) | ||
| .map(|opt| opt.map(|b| b.to_vec())) |
There was a problem hiding this comment.
The .to_vec() here converts the Bytes from StorageReadView::get back to Vec<u8> because TrieDB::get returns Option<Vec<u8>>. This means the zero-copy RocksDB read benefit doesn't reach the trie — every trie node read still allocates. To realize the "zero-copy from RocksDB" benefit claimed in the PR description, TrieDB::get would need to return Bytes (or Cow<[u8]>), which is a bigger refactor but would be the logical next step.
Summary
ValueRLPtype fromVec<u8>toBytes(reference-counted) for O(1) clones in trie cache lookupsBytes(zero-copy from RocksDB)put_cf)Key Changes
Trie layer:
ValueRLP = Bytesinstead ofVec<u8>LeafNodeandBranchNodestoreBytesvaluesBytesWrapperfor rkyv serializationStorage API:
get()returnsOption<Bytes>- zero-copy from RocksDB'sVec<u8>put_batch(&[(&[u8], &[u8])])- references passed directly to RocksDBPerformance focus:
Bytes, copies on writeBytesclones instead of O(n)VeccopiesTest plan
cargo check --workspacepasses