-
Notifications
You must be signed in to change notification settings - Fork 290
Avoid potential stalls in the Eth1 monitor code due to missing timeouts #3905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -425,7 +425,8 @@ proc close(p: Web3DataProviderRef): Future[void] {.async.} = | |||
except CatchableError: | |||
debug "Failed to clean up block headers subscription properly" | |||
|
|||
await p.web3.close() | |||
awaitWithTimeout(p.web3.close(), 30.seconds): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, I think what's missing here is a layer that specifically deals with timeouts and round-robin of web3 providers - this is now done in awaitWithRetries
and awaitWithTimeout
which contributes to the code being convoluted and difficult to work with: everyone has to remember to call the right function and the responsibilities are mixed up because Eth1Monitor does everything - instead, a "manager" responsible for this could be introduced, then Eth1Monitor would call the manager with "simple" await
and the manager would be responsible for routing the request to the correct web3 in the pool of web3 urls etc, keeping track of which endpoints are working, which are timed out and so on).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fundamentally, all networking code must be carefully guarded with timeouts in order to prevent the possibility of an "async thread" ending locked up forever. What you seem to be arguing is that the request/response APIs like the ones exposed by the web3 library could benefit from handling timeouts on their own and treating them as errors (a similar treatment is present in the LibP2P request/response APIs for example). I don't disagree, but I've avoided introducing large-scale rewrites of the underlying libraries until now. With the planned migration to nim-json-serizalition, I'll have to spend time on these lower-layer libraries and this would be a good opportunity to introduce reforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ones exposed by the web3 library could benefit from handling timeouts on their own
not quite - we have several responsibilities - one is "map nim to web3-json-rpc" - this is nim-web3 more or less - another is "manage multiple web3 connections" and yet another is "interpret the data coming from web3" - the missing piece in our code is the middle - eth1monitor tries to both be a connection manager and interpreter, and this is where the current design goes off - a connection manager could live in nim-web3 as a separate module (though I find it appropriate to write it in nimbus-eth2 and iff another project can benefit, move it at that point) and it seems appropriate that such a multiplexer would handle timeouts as well, but the main point is to give this responsibility to a component separate from eth1monitor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for example, how long the timeouts should be might vary between requests and different users of nim-web3, so putting it in nim-web3 itself would be difficult - it's the consumer of nim-web3 that knows for how long it is meaningful to wait for a response and this will differ between procedures (a global timeout is inappropriate) and between calls (the timeout for getting eth1 execution data to serve getblocksbyrange depends on a timeout budget that nim-web3 cannot be aware of)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a connection manager
This line of reasoning is based on a assumption that the web3 providers are interchangeable, that the logic in the Eth1Monitor can be oblivious to the fact which provider is being used behind the scenes.
This is not quite true, because the providers may be serving different data - they may be synced to different points in time for example. So, when the Eth1 monitor changes providers, it goes through some sanity checks that try to ensure it's safe to continue from where the previous provider left us. If these checks fail, we backtrack to an earlier safer position - the last block confirmed by the majority of validators in the network. These checks are required because some providers are accidentally malicious - they fail to deliver deposit log events without indicating any error and our client ends up with a wrong view about the chain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first and foremost change needed here is to split out the logic of building the chain into a spec-like module that has that level of testing and failure modes - this is entirely synchronous processing, and it should have tests to cover the kinds of errors that "maliciously broken" web3 endpoints could possibly provide (out-of-order blocks, forks, reorgs etc) - once that's done, there will be very little actual logic left in eth1monitor, and that logic can focus on juggling web3 connections - now, although this requires knowledge about errors flowing back and forth between the layers, that knowledge can be simplified down to an api between the two worlds: one which manages a pool of web3 connections in different states (ready/broken/etc) and another that validates their interactions. it's doable, but it takes some refactoring.
No description provided.