Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle EL offline/unavailable in the Sync API #290

Merged
merged 6 commits into from
Feb 2, 2023

Conversation

mehdi-aouadi
Copy link
Contributor

@mehdi-aouadi mehdi-aouadi commented Jan 20, 2023

When the execution client if offline/unavailable, it can't be properly reported by the sync API response.
Added an el_offline field to the GetSyncingStatusResponse.

apis/node/syncing.yaml Outdated Show resolved Hide resolved
@rolfyone
Copy link
Collaborator

There's a fairly long discussion in APIs channel on discord re. this ticket. I believe the preference would be to not return a 503, but instead have a el_offline or similar attribute that indicates the el is not responding... it does sound like a good suggestion from the outside, so posting here - thanks @mcdee for the suggestion.

rolfyone
rolfyone previously approved these changes Feb 2, 2023
Copy link
Collaborator

@rolfyone rolfyone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just PR comment now.

@rolfyone
Copy link
Collaborator

rolfyone commented Feb 2, 2023

oh - can we add an entry in changes.md to indicate that the endpoint has been updated? Cheers!

@mehdi-aouadi mehdi-aouadi dismissed stale reviews from rolfyone and ghost via 4c61b18 February 2, 2023 08:15
@mehdi-aouadi
Copy link
Contributor Author

@rolfyone I added an entry in the changes.md as requested

CHANGES.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@rolfyone rolfyone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rolfyone rolfyone merged commit 1741560 into ethereum:master Feb 2, 2023
etan-status added a commit to status-im/nimbus-eth2 that referenced this pull request Apr 26, 2023
Add compatibility with ethereum/beacon-APIs#290
to the beacon node. Behaviour when configured with multiple ELs is not
specified; intention suggests to indicate whether all ELs are offline.
etan-status added a commit to status-im/nimbus-eth2 that referenced this pull request Apr 27, 2023
Add compatibility with ethereum/beacon-APIs#290
to the beacon node. Behaviour when configured with multiple ELs is not
specified; intention suggests to indicate whether all ELs are offline.
etan-status added a commit to status-im/nimbus-eth2 that referenced this pull request Apr 27, 2023
Updates the validator client to track the new `el_offline` field in the
`/eth/v1/node/syncing` response.
ethereum/beacon-APIs#290

If `el_offline` is present, the BN will be treated same as if it was
optimistically synced. Note that a BN could report that it is in sync,
but may still have intermittent connection issues to the EL, making it
impossible to `getPayload` from them since Capella.
etan-status added a commit to status-im/nimbus-eth2 that referenced this pull request Apr 27, 2023
Updates the validator client to track the new `el_offline` field in the
`/eth/v1/node/syncing` response.
ethereum/beacon-APIs#290

If `el_offline` is present, the BN will be treated same as if it was
optimistically synced. Note that a BN could report that it is in sync,
but may still have intermittent connection issues to the EL, making it
impossible to `getPayload` from them since Capella.
@arnetheduck
Copy link
Contributor

is there a rationale for this field? ie what actually useful information does it bring, on top of is_optimistic (which already requires an "online" el)?

@michaelsproul
Copy link
Collaborator

@arnetheduck The rationale for us in Lighthouse is:

  • We don't want to import blocks optimistically when the EL errors, because optimistic import necessarily disables attestation and block production. Our concern is that a "bomb" block which makes the EL timeout or error across the whole network could pose a liveness risk if it throws all the CL clients into optimistic sync.
  • Instead, we want to give a weak indication to any connected VC that "hey maybe you shouldn't use this node if you've got other options". If the EL is not responding, or has errored/timed out on the latest newPayload call, then we mark it as el_offline: true.
  • The BN stays live (keeps producing attestations & blocks) while el_offline: true.

Credit to @paulhauner for the bomb block argument.

@paulhauner
Copy link
Contributor

I appreciate that Nimbus has decided that they feel it's safe to go ahead and optimistically import against the optimistic sync spec (I mean that genuinely, since the spec may be over-restrictive now the merge transition has passed). However I don't think that there's been enough evidence presented about the liveness-safety of that approach to say that everyone should be optimistically importing on errors and that the API should only recognize that approach.

Deciding to optimistically import blocks with an offline/erroring EE is a non-trivial decision which, in my opinion, is neither objectively good or objectively bad. I see that giving CL clients the ability to differentiate between optimistic and offline allows for greater client diversity.

@rolfyone
Copy link
Collaborator

I appreciate that Nimbus has decided that they feel it's safe to go ahead and optimistically import against the optimistic sync spec (I mean that genuinely, since the spec may be over-restrictive now the merge transition has passed)

Is this something we should consider raising an issue on?

@michaelsproul
Copy link
Collaborator

michaelsproul commented May 24, 2023

Is this something we should consider raising an issue on?

We could, although I think @paulhauner's argument implies that we shouldn't expect every impl to do as Nimbus does

@arnetheduck
Copy link
Contributor

arnetheduck commented May 24, 2023

I'm asking for the rationale simply because the PR fails to include one: I can however see at least a few problems with it:

  • the el_offline flag represents a leaky abstraction: it seems that we're reporting to the validator client something that the validator should not have to be concerned with because it sits in a separate abstraction layer
  • the structure of this API invites race conditions / TOCTOU issues - just because the EL is online or offline in this sync request does not mean that in executing duties, the problem has or hasn't been resolved - the non-racy place for this information would have been the request that produces data for the validator client to sign
  • offline is not really well-defined: what happens if you have two EL:s? should it be set to offline when a single request times out or several? is it offline if the "configuration" request fails but newpayload succeeds?

We don't want to import blocks optimistically when the EL errors, because optimistic import necessarily disables attestation and block production.

This seems unrelated to el_offline and/or is a manifestation of the mentioned abstraction leak - ie the validator client shouldn't really have to know/care about whether the EL is offline or not: it should merely be given something reasonable it can sign - as an example, returning the latest valid block (similar to the latest-valid-hash of the execution API) avoids both the abstraction leak and gives the validator client something to work with.

I can see the argument about liveness safety, but this seems to me more of an implementation detail in the BN more than anything else: it's something that should be solved at the optimistic sync spec / execution api spec level rather than leaking into the VC.

The internal Nimbus implementation "logically" considers itself to be one of many EL:s connected (we support driving multiple EL:s with a single BN) and performs the checks that an execution client is mandated to perform according to the execution API specification in order to be allowed to return SYNCING which it then does, and therefore "should" satisfy the optimistic sync spec as well.

Long story short, because of these differences in implementations and the implied reliance on internal implementation details in how it should be implemented, it risks becoming an obstacle to diversity rather than a boon: we suddenly need to know how VC:s respond to it / what they do with the information in order to "balance" the implementation well.

@mehdi-aouadi mehdi-aouadi deleted the sync-api-el-unavailable branch May 24, 2024 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants