Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Merged by Bors] - Implement el_offline and use it in the VC #4295

Closed
wants to merge 6 commits into from

Conversation

michaelsproul
Copy link
Member

Issue Addressed

Closes #4291, part of #3613.

Proposed Changes

  • Implement the el_offline field on /eth/v1/node/syncing. We set el_offline=true if:

    • The EL's internal status is Offline or AuthFailed, or
    • The most recent call to newPayload resulted in an error (more on this in a moment).
  • Use the el_offline field in the VC to mark nodes with offline ELs as unsynced. These nodes will still be used, but only after synced nodes.

  • Overhaul the usage of RequireSynced so that ::No is used almost everywhere. The --allow-unsynced flag was broken and had the opposite effect to intended, so it has been deprecated.

  • Add tests for the EL being offline on the upcheck call, and being offline due to the newPayload check.

Why track newPayload errors?

Tracking the EL's online/offline status is too coarse-grained to be useful in practice, because:

  • If the EL is timing out to some calls, it's unlikely to timeout on the upcheck call, which is just eth_syncing. Every failed call is followed by an upcheck here, which would have the effect of masking the failure and keeping the status online.
  • The newPayload call is the most likely to time out. It's the call in which ELs tend to do most of their work (often 1-2 seconds), with forkchoiceUpdated usually returning much faster (<50ms).
  • If newPayload is failing consistently (e.g. timing out) then this is a good indication that either the node's EL is in trouble, or the network as a whole is. In the first case validator clients should prefer other BNs if they have one available. In the second case, all of their BNs will likely report el_offline and they'll just have to proceed with trying to use them.

Additional Changes

  • Add utility method ForkName::latest which is quite convenient for test writing, but probably other things too.
  • Delete some stale comments from when we used to support multiple execution nodes.

@michaelsproul michaelsproul added val-client Relates to the validator client binary ready-for-review The code is ready for review v4.2.0 Q2 2023 labels May 16, 2023
Copy link
Member

@pawanjay176 pawanjay176 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just had a small question

beacon_node/execution_layer/src/lib.rs Outdated Show resolved Hide resolved
@michaelsproul
Copy link
Member Author

Ready for final review and merge. I've added more aggressive polling to the VC, after noticing that the previous lazy polling strategy didn't work on Goerli

@paulhauner paulhauner self-requested a review May 17, 2023 01:59
@paulhauner paulhauner added the backwards-incompat Backwards-incompatible API change label May 17, 2023
@paulhauner
Copy link
Member

paulhauner commented May 17, 2023

Flagging as backwards-incompat so we mention the addition of ee_offline to node/version in the release notes.

I believe v4.2.0 will be backwards compatible with earlier VCs since we don't have #[serde(deny_unknown_fields)] on SyncingData.

Copy link
Member

@paulhauner paulhauner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Approved with a few questions/comments.

@@ -238,6 +238,11 @@ impl Engine {
**self.state.read().await == EngineStateInternal::Synced
}

/// Returns `true` if the engine has a status other than synced or syncing.
pub async fn is_offline(&self) -> bool {
EngineState::from(**self.state.read().await) == EngineState::Offline
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have thoughts on catching EngineState::AuthFailed as well? I think the newPayload failures would catch an auth failure in practice. I don't feel strongly either way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's accurate. If the auth is wrong, then the EL is effectively offline.

I imagine this would only be an issue during initial setup, or if resyncing the EE and deleting the JWT secret.

///
/// This is used *only* in the informational sync status endpoint, so that a VC using this
/// node can prefer another node with a healthier EL.
last_new_payload_errored: RwLock<bool>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had originally expected that we'd track the last fcU error rather than the last newPayload error.

My reasoning was that it's the last call we'd do in the block import process so it has the most up-to-date information about EE state. However I see now that if we fail a newPayload then we won't end up calling fcU and we'd end up kinda stuck.

If we're choosing just one, then I now prefer newPayload. No changes suggested, just sharing my reasoning.

@@ -1116,18 +1131,6 @@ impl<T: EthSpec> ExecutionLayer<T> {
}

/// Maps to the `engine_newPayload` JSON-RPC call.
///
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice comment cleanups 👍

@michaelsproul
Copy link
Member Author

I believe v4.2.0 will be backwards compatible with earlier VCs since we don't have #[serde(deny_unknown_fields)] on SyncingData.

Yeah I tested this when deploying to Goerli as well. The old VC doesn't mind the new field, and the new VC doesn't mind if it's not there

@michaelsproul michaelsproul added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels May 17, 2023
@michaelsproul
Copy link
Member Author

The doppelganger protection tests are failing because the EL is offline, but I think this is fixed by #3807. Maybe we could (ambitiously) try batching them together, or failing that, mergin 3807 and then updating this PR

@michaelsproul
Copy link
Member Author

bors r+

bors bot pushed a commit that referenced this pull request May 17, 2023
## Issue Addressed

Closes #4291, part of #3613.

## Proposed Changes

- Implement the `el_offline` field on `/eth/v1/node/syncing`. We set `el_offline=true` if:
  - The EL's internal status is `Offline` or `AuthFailed`, _or_
  - The most recent call to `newPayload` resulted in an error (more on this in a moment).

- Use the `el_offline` field in the VC to mark nodes with offline ELs as _unsynced_. These nodes will still be used, but only after synced nodes.
- Overhaul the usage of `RequireSynced` so that `::No` is used almost everywhere. The `--allow-unsynced` flag was broken and had the opposite effect to intended, so it has been deprecated.
- Add tests for the EL being offline on the upcheck call, and being offline due to the newPayload check.


## Why track `newPayload` errors?

Tracking the EL's online/offline status is too coarse-grained to be useful in practice, because:

- If the EL is timing out to some calls, it's unlikely to timeout on the `upcheck` call, which is _just_ `eth_syncing`. Every failed call is followed by an upcheck [here](https://github.com/sigp/lighthouse/blob/693886b94176faa4cb450f024696cb69cda2fe58/beacon_node/execution_layer/src/engines.rs#L372-L380), which would have the effect of masking the failure and keeping the status _online_.
- The `newPayload` call is the most likely to time out. It's the call in which ELs tend to do most of their work (often 1-2 seconds), with `forkchoiceUpdated` usually returning much faster (<50ms).
- If `newPayload` is failing consistently (e.g. timing out) then this is a good indication that either the node's EL is in trouble, or the network as a whole is. In the first case validator clients _should_ prefer other BNs if they have one available. In the second case, all of their BNs will likely report `el_offline` and they'll just have to proceed with trying to use them.

## Additional Changes

- Add utility method `ForkName::latest` which is quite convenient for test writing, but probably other things too.
- Delete some stale comments from when we used to support multiple execution nodes.
@bors
Copy link

bors bot commented May 17, 2023

@bors bors bot changed the title Implement el_offline and use it in the VC [Merged by Bors] - Implement el_offline and use it in the VC May 17, 2023
@bors bors bot closed this May 17, 2023
This was referenced May 17, 2023
bors bot pushed a commit that referenced this pull request May 22, 2023
## Issue Addressed

#4309 (comment)

## Proposed Changes

Log the `Connected to beacon node` message only if the node was previously offline. This avoids a regression in logging after #4295, whereby the `Connected to beacon node` message would be logged every slot.

The new reduced logging is _slightly different_ from what we had prior to my changes in #4295. The main difference is that we used to log the `Connected` message whenever a node was online and subject to a health check (for being unhealthy in some other way). I think the new behaviour is reasonable, as the `Connected` message isn't particularly helpful if the BN is unhealthy, and the specific reason for unhealthiness will be logged by the warnings for `is_compatible`/`is_synced`.
@michaelsproul michaelsproul deleted the el-offline branch May 31, 2023 00:42
ghost pushed a commit to oone-world/lighthouse that referenced this pull request Jul 13, 2023
## Issue Addressed

Closes sigp#4291, part of sigp#3613.

## Proposed Changes

- Implement the `el_offline` field on `/eth/v1/node/syncing`. We set `el_offline=true` if:
  - The EL's internal status is `Offline` or `AuthFailed`, _or_
  - The most recent call to `newPayload` resulted in an error (more on this in a moment).

- Use the `el_offline` field in the VC to mark nodes with offline ELs as _unsynced_. These nodes will still be used, but only after synced nodes.
- Overhaul the usage of `RequireSynced` so that `::No` is used almost everywhere. The `--allow-unsynced` flag was broken and had the opposite effect to intended, so it has been deprecated.
- Add tests for the EL being offline on the upcheck call, and being offline due to the newPayload check.


## Why track `newPayload` errors?

Tracking the EL's online/offline status is too coarse-grained to be useful in practice, because:

- If the EL is timing out to some calls, it's unlikely to timeout on the `upcheck` call, which is _just_ `eth_syncing`. Every failed call is followed by an upcheck [here](https://github.com/sigp/lighthouse/blob/693886b94176faa4cb450f024696cb69cda2fe58/beacon_node/execution_layer/src/engines.rs#L372-L380), which would have the effect of masking the failure and keeping the status _online_.
- The `newPayload` call is the most likely to time out. It's the call in which ELs tend to do most of their work (often 1-2 seconds), with `forkchoiceUpdated` usually returning much faster (<50ms).
- If `newPayload` is failing consistently (e.g. timing out) then this is a good indication that either the node's EL is in trouble, or the network as a whole is. In the first case validator clients _should_ prefer other BNs if they have one available. In the second case, all of their BNs will likely report `el_offline` and they'll just have to proceed with trying to use them.

## Additional Changes

- Add utility method `ForkName::latest` which is quite convenient for test writing, but probably other things too.
- Delete some stale comments from when we used to support multiple execution nodes.
ghost pushed a commit to oone-world/lighthouse that referenced this pull request Jul 13, 2023
## Issue Addressed

sigp#4309 (comment)

## Proposed Changes

Log the `Connected to beacon node` message only if the node was previously offline. This avoids a regression in logging after sigp#4295, whereby the `Connected to beacon node` message would be logged every slot.

The new reduced logging is _slightly different_ from what we had prior to my changes in sigp#4295. The main difference is that we used to log the `Connected` message whenever a node was online and subject to a health check (for being unhealthy in some other way). I think the new behaviour is reasonable, as the `Connected` message isn't particularly helpful if the BN is unhealthy, and the specific reason for unhealthiness will be logged by the warnings for `is_compatible`/`is_synced`.
Woodpile37 pushed a commit to Woodpile37/lighthouse that referenced this pull request Jan 6, 2024
Closes sigp#4291, part of sigp#3613.

- Implement the `el_offline` field on `/eth/v1/node/syncing`. We set `el_offline=true` if:
  - The EL's internal status is `Offline` or `AuthFailed`, _or_
  - The most recent call to `newPayload` resulted in an error (more on this in a moment).

- Use the `el_offline` field in the VC to mark nodes with offline ELs as _unsynced_. These nodes will still be used, but only after synced nodes.
- Overhaul the usage of `RequireSynced` so that `::No` is used almost everywhere. The `--allow-unsynced` flag was broken and had the opposite effect to intended, so it has been deprecated.
- Add tests for the EL being offline on the upcheck call, and being offline due to the newPayload check.

Tracking the EL's online/offline status is too coarse-grained to be useful in practice, because:

- If the EL is timing out to some calls, it's unlikely to timeout on the `upcheck` call, which is _just_ `eth_syncing`. Every failed call is followed by an upcheck [here](https://github.com/sigp/lighthouse/blob/693886b94176faa4cb450f024696cb69cda2fe58/beacon_node/execution_layer/src/engines.rs#L372-L380), which would have the effect of masking the failure and keeping the status _online_.
- The `newPayload` call is the most likely to time out. It's the call in which ELs tend to do most of their work (often 1-2 seconds), with `forkchoiceUpdated` usually returning much faster (<50ms).
- If `newPayload` is failing consistently (e.g. timing out) then this is a good indication that either the node's EL is in trouble, or the network as a whole is. In the first case validator clients _should_ prefer other BNs if they have one available. In the second case, all of their BNs will likely report `el_offline` and they'll just have to proceed with trying to use them.

- Add utility method `ForkName::latest` which is quite convenient for test writing, but probably other things too.
- Delete some stale comments from when we used to support multiple execution nodes.
Woodpile37 pushed a commit to Woodpile37/lighthouse that referenced this pull request Jan 6, 2024
## Issue Addressed

sigp#4309 (comment)

## Proposed Changes

Log the `Connected to beacon node` message only if the node was previously offline. This avoids a regression in logging after sigp#4295, whereby the `Connected to beacon node` message would be logged every slot.

The new reduced logging is _slightly different_ from what we had prior to my changes in sigp#4295. The main difference is that we used to log the `Connected` message whenever a node was online and subject to a health check (for being unhealthy in some other way). I think the new behaviour is reasonable, as the `Connected` message isn't particularly helpful if the BN is unhealthy, and the specific reason for unhealthiness will be logged by the warnings for `is_compatible`/`is_synced`.
Woodpile37 pushed a commit to Woodpile37/lighthouse that referenced this pull request Jan 6, 2024
Closes sigp#4291, part of sigp#3613.

- Implement the `el_offline` field on `/eth/v1/node/syncing`. We set `el_offline=true` if:
  - The EL's internal status is `Offline` or `AuthFailed`, _or_
  - The most recent call to `newPayload` resulted in an error (more on this in a moment).

- Use the `el_offline` field in the VC to mark nodes with offline ELs as _unsynced_. These nodes will still be used, but only after synced nodes.
- Overhaul the usage of `RequireSynced` so that `::No` is used almost everywhere. The `--allow-unsynced` flag was broken and had the opposite effect to intended, so it has been deprecated.
- Add tests for the EL being offline on the upcheck call, and being offline due to the newPayload check.

Tracking the EL's online/offline status is too coarse-grained to be useful in practice, because:

- If the EL is timing out to some calls, it's unlikely to timeout on the `upcheck` call, which is _just_ `eth_syncing`. Every failed call is followed by an upcheck [here](https://github.com/sigp/lighthouse/blob/693886b94176faa4cb450f024696cb69cda2fe58/beacon_node/execution_layer/src/engines.rs#L372-L380), which would have the effect of masking the failure and keeping the status _online_.
- The `newPayload` call is the most likely to time out. It's the call in which ELs tend to do most of their work (often 1-2 seconds), with `forkchoiceUpdated` usually returning much faster (<50ms).
- If `newPayload` is failing consistently (e.g. timing out) then this is a good indication that either the node's EL is in trouble, or the network as a whole is. In the first case validator clients _should_ prefer other BNs if they have one available. In the second case, all of their BNs will likely report `el_offline` and they'll just have to proceed with trying to use them.

- Add utility method `ForkName::latest` which is quite convenient for test writing, but probably other things too.
- Delete some stale comments from when we used to support multiple execution nodes.
Woodpile37 pushed a commit to Woodpile37/lighthouse that referenced this pull request Jan 6, 2024
## Issue Addressed

sigp#4309 (comment)

## Proposed Changes

Log the `Connected to beacon node` message only if the node was previously offline. This avoids a regression in logging after sigp#4295, whereby the `Connected to beacon node` message would be logged every slot.

The new reduced logging is _slightly different_ from what we had prior to my changes in sigp#4295. The main difference is that we used to log the `Connected` message whenever a node was online and subject to a health check (for being unhealthy in some other way). I think the new behaviour is reasonable, as the `Connected` message isn't particularly helpful if the BN is unhealthy, and the specific reason for unhealthiness will be logged by the warnings for `is_compatible`/`is_synced`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backwards-incompat Backwards-incompatible API change ready-for-merge This PR is ready to merge. v4.2.0 Q2 2023 val-client Relates to the validator client binary
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants