Skip to content

Conversation

zah
Copy link
Contributor

@zah zah commented Jul 23, 2022

No description provided.

@github-actions
Copy link

Unit Test Results

       12 files  ±0       860 suites  ±0   1h 1m 16s ⏱️ - 4m 36s
  1 914 tests ±0    1 766 ✔️ ±0  148 💤 ±0  0 ±0 
10 373 runs  ±0  10 167 ✔️ ±0  206 💤 ±0  0 ±0 

Results for commit 3234875. ± Comparison against base commit 1a1553f.

@@ -425,7 +425,8 @@ proc close(p: Web3DataProviderRef): Future[void] {.async.} =
except CatchableError:
debug "Failed to clean up block headers subscription properly"

await p.web3.close()
awaitWithTimeout(p.web3.close(), 30.seconds):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, I think what's missing here is a layer that specifically deals with timeouts and round-robin of web3 providers - this is now done in awaitWithRetries and awaitWithTimeout which contributes to the code being convoluted and difficult to work with: everyone has to remember to call the right function and the responsibilities are mixed up because Eth1Monitor does everything - instead, a "manager" responsible for this could be introduced, then Eth1Monitor would call the manager with "simple" await and the manager would be responsible for routing the request to the correct web3 in the pool of web3 urls etc, keeping track of which endpoints are working, which are timed out and so on).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fundamentally, all networking code must be carefully guarded with timeouts in order to prevent the possibility of an "async thread" ending locked up forever. What you seem to be arguing is that the request/response APIs like the ones exposed by the web3 library could benefit from handling timeouts on their own and treating them as errors (a similar treatment is present in the LibP2P request/response APIs for example). I don't disagree, but I've avoided introducing large-scale rewrites of the underlying libraries until now. With the planned migration to nim-json-serizalition, I'll have to spend time on these lower-layer libraries and this would be a good opportunity to introduce reforms.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ones exposed by the web3 library could benefit from handling timeouts on their own

not quite - we have several responsibilities - one is "map nim to web3-json-rpc" - this is nim-web3 more or less - another is "manage multiple web3 connections" and yet another is "interpret the data coming from web3" - the missing piece in our code is the middle - eth1monitor tries to both be a connection manager and interpreter, and this is where the current design goes off - a connection manager could live in nim-web3 as a separate module (though I find it appropriate to write it in nimbus-eth2 and iff another project can benefit, move it at that point) and it seems appropriate that such a multiplexer would handle timeouts as well, but the main point is to give this responsibility to a component separate from eth1monitor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for example, how long the timeouts should be might vary between requests and different users of nim-web3, so putting it in nim-web3 itself would be difficult - it's the consumer of nim-web3 that knows for how long it is meaningful to wait for a response and this will differ between procedures (a global timeout is inappropriate) and between calls (the timeout for getting eth1 execution data to serve getblocksbyrange depends on a timeout budget that nim-web3 cannot be aware of)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a connection manager

This line of reasoning is based on a assumption that the web3 providers are interchangeable, that the logic in the Eth1Monitor can be oblivious to the fact which provider is being used behind the scenes.

This is not quite true, because the providers may be serving different data - they may be synced to different points in time for example. So, when the Eth1 monitor changes providers, it goes through some sanity checks that try to ensure it's safe to continue from where the previous provider left us. If these checks fail, we backtrack to an earlier safer position - the last block confirmed by the majority of validators in the network. These checks are required because some providers are accidentally malicious - they fail to deliver deposit log events without indicating any error and our client ends up with a wrong view about the chain.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first and foremost change needed here is to split out the logic of building the chain into a spec-like module that has that level of testing and failure modes - this is entirely synchronous processing, and it should have tests to cover the kinds of errors that "maliciously broken" web3 endpoints could possibly provide (out-of-order blocks, forks, reorgs etc) - once that's done, there will be very little actual logic left in eth1monitor, and that logic can focus on juggling web3 connections - now, although this requires knowledge about errors flowing back and forth between the layers, that knowledge can be simplified down to an api between the two worlds: one which manages a pool of web3 connections in different states (ready/broken/etc) and another that validates their interactions. it's doable, but it takes some refactoring.

@zah zah merged commit 7d27e10 into testing Jul 25, 2022
@zah zah deleted the more-web3-retries-20220722 branch July 25, 2022 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants