Description
Rationale
Currently, when go-ethereum builds/executes a block it processes transactions in the EVM (while using the O(1) snapshot to perform DB reads as needed) and then needs to apply all of the state changes to the trie and compute a new state root.
Typically, the state shuffling is significantly more expensive than the actual EVM processing time (although for the new path based scheme with too much RAM this is close to no longer being the case https://twitter.com/peter_szilagyi/status/1708013385558671662).
When executing transactions in the EVM, the stateDB uses prefetching logic to attempt to pre-warm the trie's cache with all of the intermediate nodes that will need to be shuffled around when it's time to commit the statedb and compute a new state root. If the intermediate nodes are not already in memory, then this can result in a large number of DB reads which can take up a large amount of the time spent on state shuffling.
For large storage tries the performance of state fetching can be much lower since the trie implementation is not safe for concurrent use (https://github.com/ethereum/go-ethereum/blob/v1.13.1/trie/trie.go#L37). Although each storage trie gets its own prefetching goroutine (https://github.com/ethereum/go-ethereum/blob/master/core/state/trie_prefetcher.go#L153) if the majority of the workload comes from a single storage trie, then prefetching can be extremely inefficient.
Ideally, we can parallelize the DB reads to cache all of the intermediate nodes in memory prior to committing the statedb.
Implementation
There are two different approaches that I've played around with and I'm happy to complete either implementation. I wanted to put this issue up first to get feedback on whether this change makes sense to the geth team and if so, which implementation would be preferred.
The difficult part is finding a way to work around the fact that the trie implementation is not safe for concurrent use and refactoring it to support concurrent operations from an external caller would be a very non-trivial re-write.
Much better to work completely around that problem.
Fetch in Parallel Using Independent Tries
Each individual trie is not safe for concurrent use, but we can instead create multiple instances of the same trie in order to fetch all of the requested keys in parallel and pull all of the necessary intermediate trie nodes from disk into the trie database's cache.
Although the instance of the trie held by the statedb may not have all fully expanded nodes, it would fetch all of the intermediate nodes into the trie database's cache, so that when it came time to commit the trie, each trie node is already in memory.
I put up a rough implementation of what it would look like to add this into updateTrie
within each state object here: master...aaronbuchwald:go-ethereum:parallel-fetch-update-trie. It may make more sense to move this to the prefetcher code, but this was much simpler as a proof of concept.
Support BatchCacheKeys
/ BatchPut
Internally to the Trie
Much easier than parallelizing the trie for external callers is to support parallel operations internal to the trie!
Instead of re-writing the trie to support concurrency, we can construct a trie out of the get/put operations that we want to apply.
For BatchCacheKeys
, we can construct a trie from the requested keys and arbitrary non-empty values (as required by the current trie code since an empty value is treated as a delete).
Then we can traverse the actual trie and our batchTrie. By traversing both tries, we can serve all of the requests encoded in the batch trie and parallelize different sub tries that do not depend on each other.
For example:
t:
graph TD;
r-->0;
r-->1;
0-->A;
1-->B;
t':
graph TD;
r'-->0;
r'-->1;
0-->A';
1-->B';
In this case, we traverse r and r' and see they both have children at nibbles 0 and 1. Therefore, we can apply A' to A and B' to B in parallel since they are completely independent. Once both sub tries have completed, we can apply the resulting update to get r''
.
The same logic can be applied to either BatchGet
or BatchPut
operations to either warm the cache or directly parallelize the put operation.
I have a WIP for this implementation here: https://github.com/aaronbuchwald/go-ethereum/blob/trie-batch/trie/trie_parallel.go#L43 but since there are 4 different node types to deal with my current implementation is much more complex than I'd like and I am still running into a bug in my fuzz test when dealing with values of length 0 (which should be treated as deletes, but still pass the test unless I'm missing something).
At least as a first change, I think the first solution is much better since it's significantly simpler and since DB reads will be the dominating factor, it should be very similar in performance. Also, huge thanks to @dboehm-avalabs for coming up with this much simpler approach here: ava-labs/avalanchego#2128.