Avoid potential stalls in the Eth1 monitor code due to missing timeouts #3905

zah · 2022-07-23T00:50:42Z

No description provided.

beacon_chain/eth1/eth1_monitor.nim

github-actions · 2022-07-23T09:00:09Z

Unit Test Results

      12 files ±0     860 suites ±0 1h 1m 16s ⏱️ - 4m 36s
  1 914 tests ±0   1 766 ✔️ ±0 148 💤 ±0 0 ❌ ±0
10 373 runs ±0 10 167 ✔️ ±0 206 💤 ±0 0 ❌ ±0

Results for commit 3234875. ± Comparison against base commit 1a1553f.

beacon_chain/eth1/eth1_monitor.nim

arnetheduck · 2022-07-24T07:19:09Z

beacon_chain/eth1/eth1_monitor.nim

@@ -425,7 +425,8 @@ proc close(p: Web3DataProviderRef): Future[void] {.async.} =
    except CatchableError:
      debug "Failed to clean up block headers subscription properly"

-  await p.web3.close()
+  awaitWithTimeout(p.web3.close(), 30.seconds):


In general, I think what's missing here is a layer that specifically deals with timeouts and round-robin of web3 providers - this is now done in awaitWithRetries and awaitWithTimeout which contributes to the code being convoluted and difficult to work with: everyone has to remember to call the right function and the responsibilities are mixed up because Eth1Monitor does everything - instead, a "manager" responsible for this could be introduced, then Eth1Monitor would call the manager with "simple" await and the manager would be responsible for routing the request to the correct web3 in the pool of web3 urls etc, keeping track of which endpoints are working, which are timed out and so on).

Fundamentally, all networking code must be carefully guarded with timeouts in order to prevent the possibility of an "async thread" ending locked up forever. What you seem to be arguing is that the request/response APIs like the ones exposed by the web3 library could benefit from handling timeouts on their own and treating them as errors (a similar treatment is present in the LibP2P request/response APIs for example). I don't disagree, but I've avoided introducing large-scale rewrites of the underlying libraries until now. With the planned migration to nim-json-serizalition, I'll have to spend time on these lower-layer libraries and this would be a good opportunity to introduce reforms.

ones exposed by the web3 library could benefit from handling timeouts on their own

not quite - we have several responsibilities - one is "map nim to web3-json-rpc" - this is nim-web3 more or less - another is "manage multiple web3 connections" and yet another is "interpret the data coming from web3" - the missing piece in our code is the middle - eth1monitor tries to both be a connection manager and interpreter, and this is where the current design goes off - a connection manager could live in nim-web3 as a separate module (though I find it appropriate to write it in nimbus-eth2 and iff another project can benefit, move it at that point) and it seems appropriate that such a multiplexer would handle timeouts as well, but the main point is to give this responsibility to a component separate from eth1monitor.

for example, how long the timeouts should be might vary between requests and different users of nim-web3, so putting it in nim-web3 itself would be difficult - it's the consumer of nim-web3 that knows for how long it is meaningful to wait for a response and this will differ between procedures (a global timeout is inappropriate) and between calls (the timeout for getting eth1 execution data to serve getblocksbyrange depends on a timeout budget that nim-web3 cannot be aware of)

a connection manager

This line of reasoning is based on a assumption that the web3 providers are interchangeable, that the logic in the Eth1Monitor can be oblivious to the fact which provider is being used behind the scenes.

This is not quite true, because the providers may be serving different data - they may be synced to different points in time for example. So, when the Eth1 monitor changes providers, it goes through some sanity checks that try to ensure it's safe to continue from where the previous provider left us. If these checks fail, we backtrack to an earlier safer position - the last block confirmed by the majority of validators in the network. These checks are required because some providers are accidentally malicious - they fail to deliver deposit log events without indicating any error and our client ends up with a wrong view about the chain.

The first and foremost change needed here is to split out the logic of building the chain into a spec-like module that has that level of testing and failure modes - this is entirely synchronous processing, and it should have tests to cover the kinds of errors that "maliciously broken" web3 endpoints could possibly provide (out-of-order blocks, forks, reorgs etc) - once that's done, there will be very little actual logic left in eth1monitor, and that logic can focus on juggling web3 connections - now, although this requires knowledge about errors flowing back and forth between the layers, that knowledge can be simplified down to an api between the two worlds: one which manages a pool of web3 connections in different states (ready/broken/etc) and another that validates their interactions. it's doable, but it takes some refactoring.

Avoid potential stalls in the Eth1 monitor code due to missing timeouts

3234875

arnetheduck reviewed Jul 23, 2022

View reviewed changes

beacon_chain/eth1/eth1_monitor.nim Show resolved Hide resolved

arnetheduck approved these changes Jul 23, 2022

View reviewed changes

tersec reviewed Jul 23, 2022

View reviewed changes

beacon_chain/eth1/eth1_monitor.nim Outdated Show resolved Hide resolved

arnetheduck reviewed Jul 24, 2022

View reviewed changes

1 second timeout for exchange-configuration

80b8ae0

zah merged commit 7d27e10 into testing Jul 25, 2022

zah deleted the more-web3-retries-20220722 branch July 25, 2022 19:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid potential stalls in the Eth1 monitor code due to missing timeouts #3905

Avoid potential stalls in the Eth1 monitor code due to missing timeouts #3905

Uh oh!

zah commented Jul 23, 2022

Uh oh!

Uh oh!

github-actions bot commented Jul 23, 2022

Uh oh!

Uh oh!

arnetheduck Jul 24, 2022

Uh oh!

zah Jul 25, 2022

Uh oh!

arnetheduck Aug 9, 2022

Uh oh!

arnetheduck Aug 9, 2022

Uh oh!

zah Aug 19, 2022

Uh oh!

arnetheduck Sep 1, 2022

Uh oh!

Uh oh!

Avoid potential stalls in the Eth1 monitor code due to missing timeouts #3905

Avoid potential stalls in the Eth1 monitor code due to missing timeouts #3905

Uh oh!

Conversation

zah commented Jul 23, 2022

Uh oh!

Uh oh!

github-actions bot commented Jul 23, 2022

Unit Test Results

Uh oh!

Uh oh!

arnetheduck Jul 24, 2022

Choose a reason for hiding this comment

Uh oh!

zah Jul 25, 2022

Choose a reason for hiding this comment

Uh oh!

arnetheduck Aug 9, 2022

Choose a reason for hiding this comment

Uh oh!

arnetheduck Aug 9, 2022

Choose a reason for hiding this comment

Uh oh!

zah Aug 19, 2022

Choose a reason for hiding this comment

Uh oh!

arnetheduck Sep 1, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!