Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

specify liveness failsafe for builder network #47

Closed
wants to merge 1 commit into from

Conversation

ralexstokes
Copy link
Member

@ralexstokes ralexstokes commented Aug 10, 2022

this PR specifies one version of a "circuit breaker" to reduce the scope of the failure domain in the event of liveness faults on chain

note to implementers: you can just count the distinct number of block roots from the BeaconState to get a stream of the proposed inputs ("M slots over a window of N slots")

Copy link
Contributor

@terencechain terencechain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's worth mentioning that "missing slot" is not explicit only to the canonical chain, otherwise, a reorg’ing adversary could trivially disable mev-boost

@terencechain
Copy link
Contributor

note to implementers: you can just count the distinct number of block roots from the BeaconState to get a stream of the proposed inputs ("M slots over a window of N slots")

This will count orphaned blocks as missing, no?

@ralexstokes
Copy link
Member Author

I feel like the outcome on the call was that we would just stick to the canonical chain as it would be simpler to implement -- just look at the canonical beacon state

and to mitigate any sort of reorg'ing attacks we now want to widen the window to trigger the breaker

and if the current suggested range for ALLOWED_FAULTS feels too small we can bump it up (and possibly also widen the rolling window)

@terencechain
Copy link
Contributor

I feel like the outcome on the call was that we would just stick to the canonical chain as it would be simpler to implement -- just look at the canonical beacon state

I don't have an issue with this. For prysm it's equally simple looking from the canonical chain's perspective or combining canonical and forked chain's perspectives. We'll stick with counting missing slots as they are truly missed. I think client diversity is a nice to have here

@ralexstokes
Copy link
Member Author

yes, I think the first goal is: "clients have something implemented" if only to dissuade potential actors who would abuse this attack vector

the second goal is: "do clients in aggregate implement something that is hard to attack?" -- this is where it matters a bit who is doing what bc if someone has a more sensitive trigger then it could selectively be used to take the builder network offline

Beacon clients select randomized values from the following ranges when initializing the circuit breaker (so at boot time
and once for each unique boot).

* `ALLOWED_FAULTS`: between `1` and `SLOTS_PER_EPOCH // 2`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't 1 be too low?

In combination with 2 * SLOT_PER_EPOCH it will activate the circuit breaker after two epochs with missed slots. Although, it seems unlikely on mainnet but not unrealistic

Comment on lines +157 to +158
The circuit breaker becomes active if a liveness fault is observed locally which is determined by considering the beacon
chain over a rolling window of `FAULT_INSPECTION_WINDOW` slots.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so basically once the node is 'in sync', it will start counting missed slots, and missed slots exceeding ALLOWED_FAULTS within a defined window deem that we should stop using the builder network...

Rather than random, it may be more appropriate to have actual configured values to use, that way it could be tweaked so that builder can work on 'test' networks that have more missed blocks...

eg mainnet it might be unusual to see 2 missed blocks in 64 slots, but that's basically business as usual on several testnets...

Doesn't this also suggest that the builder can be taken out through no fault of it's own? If a couple of validators were offline at an inopportune moment, then the builder network could be completely shut down for the period defined by the inspection network?

Edge case would be allowed_faults = 1, fault_inspection_window = 64, then 2 missed slots in 2 epochs would stop you using builders until the first one clears... Then another happens, and you're still locked out, which could result in builder not being used for a prolonged period? So 1 missed block in 32 slots is enough in this case to ensure builder is never used?

#### Failsafe initialization

When a beacon client first boots, the circuit breaker is instantiated with the liveness parameters and is **not**
active.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I am reading it correctly then: circuit breaker is recommended to be "NOT" active on a start/restart, if thats the case then i feel it should be opposite because its highly likely that in an unhealthy network the nodes will be rebooted and should follow the safest/most conservative route.
This could be client specific but still the recommendation should be reverse.

@tersec
Copy link
Contributor

tersec commented Sep 21, 2022

In terms of reducing the scope of the failure domain in the event of liveness faults on chain, given that https://github.com/remyroy/ethstaker/blob/main/MEV-relay-list.md documents multiple relay networks, it's not obvious that it's more effectively centralized than a still-near-Geth-monoculture on the EL side.

The implied calculation is that the most likely inference from a bunch of missed slots is that (a) a large portion of the network is using the builder API; and (b) the builder API is less reliable in a correlated-across-CL-nodes way than the engine API. (a) hasn't yet proven as true as some earlier predictions, and I'm not sure (b) is useful to bet on -- even if one relay network has problems, it's reasonable to try another, and mev-boost can already be configured to try multiple relay networks.

While local EL infrastructure for the engine API is definitely less centralized per se, that doesn't protect against already-witnessed situations where all or most EL instances of a certain type across a network fail to propose good blocks in some situation.

Rather, the builder and engine APIs act a kind of diversity in themselves, and given that the former isn't simply a FB interface anymore, responding to network trouble by decreasing decentralization seems risky.

@ralexstokes
Copy link
Member Author

closing in lieu of #95

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants