-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Low IOPS during phase 5 due to synchronous reads #1516
Comments
Hi. Here is one more thing to consider for you:
And here is some common architecture decisions made by TurboGeth:
What is your blocks execution speed now? (what you see in logs?). |
I think the whole answer should be in Q&A readme section. The questions are really good and common. |
I get 0.1 blocks per second (syncing blocks 11M+) on a dedicated local 7200rpm SAS HDD (not a cloud machine). I got about the same in the cloud.
My experience is only with Linux, so not sure how intelligently other OSes manage this. It really depends on which scheduler has been chosen for the device. Turns out that linux chooses automatically When using an HDD, the default scheduler will take advantage of the elevator algorithm, which will considerably increase throughput. In any case, I am convinced that looking into async IO can only benefit here. But the best would be to get a real Kernel IO guy to give his 2 cents... |
Interesting. Thank you.
|
Another option: to make smart-contracts execution more "sequential reads friendly" - instead of random get/set mindset. |
Yes, currently the way Ethereum is designed makes is hard to parallelise - because there is a large global state and all accesses to it from transaction requires "Serialisable" isolation level. There are things we can do that will work well in many cases, but as @AskAlexSharov said, it is not going to be ready soon |
What initially convinced me to post here is that I noticed that even though the execution is progressing at ~0.1 blk/s, iostat shows me ~27 reads/s. So that would mean ~270 individual reads for the execution of a single block. Isn't it possible to issue those ~270 reads asynchronously, instead of looking at parallelizing block execution? Sorry if my suggestions are naive, I am trying to interpret what's happening on my hardware. Here's a snapshot of what I was observing:
|
@asasmoyo , hi. I got some idea! Can you enable some filesystem compression feature? Logic: our DB has 4kb reading block, but zfs default compression block is 128Kb. It means reads will stay synchronous, but they will happen in “bigger batches”. “Isn't it possible to issue those ~270 reads asynchronously, instead of looking at parallelizing block execution?” - sorry, but it is synonyms. Next read can depend on information which smart contract got from previous read. It needs or support from solidity or kind of “EVM branch prediction” (which lead to security issues in CPU’s last year). Both are hardly doable things. |
Just as another data-point — and maybe food-for-thought for @aasseman: I'm using TG (with the
My run of
How I originally formatted the device, following GCE guidance on use of local SSDs:
How I mounted the device:
|
There is a way around this that we're discussing implementing internally at our company. Our target architecture is very different than TG (we're aiming for a node that executes synchronously once, and then N stateless trace nodes that re-execute in parallel without needing the whole blockchain to be available locally), but I think the technique can be adapted. The idea is that any node that is executing a block "blind", can also be configured to capture, during execution, a listing of all the origin-storage that was read during the block's execution — accounts, storage slots, and code. Such nodes can put this "read-hint data" into some deterministic encoding, and make the "hint" available alongside the block in some way. This hint doesn't help the node that executes the block "blind", of course—it's going to be just as slow as before, if not slower (because now it's persisting hits to the StateDB's origin-storage to an in-memory log.) But for any block that enters consensus, the vast majority of its executions will be re-executions, done with the block already deep in consensus history (if not deep in the individual node's chain), where it's fully possible for the node that's executing the block to grab some hint objects that other nodes have computed about that block after they executed it, before it actually executes the block itself. So the read-hint object can help all the other nodes in the network. (And not every node needs to generate such read-hints. As long as some nodes generate them, other nodes can just mirror them.) Our own strategy for achieving stateless tracing, is to export both the origin-storage keys and values — i.e. accounts in [1] Okay, you also need to capture 256 previous block-hashes to make the Anyway, in TG's case, you don't need to capture the values. You just need to capture the keys that are read. This would form an "execution-time state-read hint" object—a list of keys that can all be prefetched into memory at the start of the block's execution. TG can make this hint object available on the P2P protocol (introducing an extension to the eth protocol, some kind of "fetch read-hint for block of hash H" message); and other TG nodes can fetch these read-hints during phased sync, before the execution phase, to accelerate the execution phase. (Presumably, the node could allow the user to configure it to throw the execution read-hints away after it uses them; but if the user doesn't do that, then the node would become a mirror for these read-hints.) Since this read-hint data doesn't specify the values to be read, there's no real trust required in the generator of the read-hints; the worst they can do is to waste other node's time by telling nodes to prefetch stuff that isn't actually part of the block's execution. And nodes could also verify the read-hint after-the-fact (i.e. after the block is done executing), just by generating a read-hint themselves during execution, and comparing the read-hint they generated to the one they relied upon at the beginning. Inaccurate read-hints would be discarded and replaced with the node's own computed one. In this way, accurate read-hints would propagate in the network. |
Note also that EIP-2930 transactions also include data that can be used as a "read-hint" in this sense (this being much of the motivation behind the EIP.) If the chain accepts, and then transitions to, EIP-2930-style transactions, then external read-hint objects would mostly no longer be needed past that point — the "read-hint" for a block could be computed just-in-time by combining the EIP-2930 tx read-lists with a few pieces of block metadata, e.g. the miner/validator account address. External read-hint objects would still provide a benefit for historical txs below the transition point, however; and would also act as a fallback for non-EIP-2930 txs, if transition to them is less-than-total. The interesting thing is that the same internal machinery within the StateDB would be necessary to drive EIP-2930 read-list generation, as would be needed to drive the generation of external read-hints. So if TG expects to support EIP-2930, it would need to include this logic anyway. Which would make an external read-hint feature already half-complete. 🙂 |
Thanks all for your suggestions! Sorry if I am asking a naive question: |
@aasseman here is an example: “Account A has 10ETH, account B has 0ETH. Block 1 transfer 10ETH from account A to B, block 2 transfer 10ETH from B to A.”. Now imagine you wanna validate blocks 1 and 2 in parallel and you start from block 2. Transaction in block 2 will fail because account B has 0ETH. This is of course incorrect because after block 1 account B has 10ETH. |
@AskAlexSharov I see, what I missed is that blocks contain only tx information, but no state... I think that now I understand the constraints a little better. Thanks for taking the time to educate me! So, the process is that you update the state of the network as transactions are replayed in sequential order since tx 0 of block 0. |
All you say is correct. State is 10s of GB. But also you need blocks itself: it’s 100Gb. TG works this way - it automagically (bu lazy) storing hot part of DB in RAM, if your machine has much RAM then entire state will stored in RAM. See how Linux mmap and page cache works, but we can’t use ReadAhead feature of mmap because entire TG db is 1Tb. FYI: on 256Gb machine with ssd genesis sync take 36hours. You can try flag —batchSize=8000M, but it only for state, not for blocks info. I see only one low-hanging fruit here: warmup blocks info in background (like read 1K blocks ahead). But i think it will not enough for your case. In short: we always targeted real hardware, to support network drives well - need servers and some set of users who need it or ready to pay for it. So far, you are first who requested it. Thank you. Good news - you can sync on one hardware, but then copy tg datadir to another hardware. Or you can try to resize machine to 256Gb RAM and after genesis sync resize machine down. JsonRPC-load is more parallel and maintaining chain head probably will work well enough. |
closing for now, because unlikely we will do something to support network disks in near future. but your PR's are welcome. |
Aha, so a big trick is simply to rent a "fully decked out" VM at like google for 2 days and then downgrade it, or download the entire DB down to your pc and stop the paid VM from there on. That would be a fast & pretty cheap solution. But would take like a week to download that 1.3 TB, DB :)
|
«download that 1.3 TB» Erigon’s db 2x compressible by lz4. We continue working on snapshot sync, which will speedup genesis sync, but no eta yet. |
I'm running into the same issue with polygon, specifically raid0 and ebs seem to be performing equivalently. My hunch is that polygon block sizes are generally so large that a majority of the block write latency is spent in the kernel taking apart the massive blocks that erigon is writing to be aligned with what can be consumed by the underlying pages, and SSD hardware. Can you comment on any tuning that can be done in erigon to validate that? |
“integration warmup —table=PlainState” may take 1 table from disk to pageCache - in multi-thread way. May be good for cold start. |
Hi!
TG, during the execution phase, seems to do all of its I/O synchronously (ie. one by one), which prevents the OS and disk to optimize seek times and reduce overall random I/O latency.
The issue is particularly visible when using a cloud instance, where generally disks are actually network block storage, which increases latency a lot even when requesting SSDs (high random IOPS, but still high latency because of the network layer).
Check this documentation for example: https://cloud.google.com/compute/docs/disks/performance. Note that it still applies on bare metal, where latency still exists.
The solution (if the logic permits) would be to have multiple light threads requesting (prefetching) the different pieces of data simultaneously, such that the OS I/O queue could optimize the disk access.
I am running the execution step right now locally on an HDD, and I am getting a throughput of ~100kB/s, read IOPS ~27, and an IO queue depth of ~1.9. Previous steps were able to fill the IO queue to ~128. (Measured through
iostat -xt 10
on a drive used exclusively for TG's data)The text was updated successfully, but these errors were encountered: