Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Merged by Bors] - Avoid hogging the fallback status lock in the VC #3022

Closed
wants to merge 4 commits into from

Conversation

paulhauner
Copy link
Member

@paulhauner paulhauner commented Feb 14, 2022

Issue Addressed

Addresses #2926

Proposed Changes

Appropriated from #2926 (comment):

When a node returns any error we call CandidateBeaconNode::set_offline which sets it's status to CandidateError::Offline. That node will then be ignored until the routine fallback_updater_service manages to reconnect to it.

However, I believe there was an issue in the CanidateBeaconNode::refesh_status method, which is used by the updater service to see if the node has come good again. It was holding a write lock on the status field whilst it polled the node status. This means a long timeout would hog the write lock and starve other processes.

When a VC is trying to access a beacon node for whatever purpose (getting duties, posting blocks, etc), it performs three passes through the lists of nodes, trying to run some generic function (closure, lambda, etc) on each node:

  • 1st pass: only try running function on all nodes which are both synced and online.
  • 2nd pass: try running function on all nodes that are online, but not necessarily synced.
  • 3rd pass: for each offline node, try refreshing its status and then running function on it.

So, it turns out that if the CanidateBeaconNode::refesh_status function from the routine update service is hogging the write-lock, the 1st pass gets blocked whilst trying to read the status of the first node. So, nodes that should be left until the 3rd pass are blocking the process of the 1st and 2nd passes, hence the behaviour described in #2926.

Additional Info

NA

@paulhauner paulhauner added the bug Something isn't working label Feb 14, 2022
@paulhauner paulhauner changed the base branch from stable to unstable February 14, 2022 23:28
@paulhauner paulhauner added the ready-for-review The code is ready for review label Feb 14, 2022
Copy link
Member

@michaelsproul michaelsproul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice find

@michaelsproul michaelsproul added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels Feb 22, 2022
@paulhauner
Copy link
Member Author

bors r+

bors bot pushed a commit that referenced this pull request Feb 22, 2022
## Issue Addressed

Addresses #2926

## Proposed Changes

Appropriated from #2926 (comment):

When a node returns *any* error we call [`CandidateBeaconNode::set_offline`](https://github.com/sigp/lighthouse/blob/c3a793fd73a3b11b130b82032904d39c952869e4/validator_client/src/beacon_node_fallback.rs#L424) which sets it's `status` to `CandidateError::Offline`. That node will then be ignored until the routine [`fallback_updater_service`](https://github.com/sigp/lighthouse/blob/c3a793fd73a3b11b130b82032904d39c952869e4/validator_client/src/beacon_node_fallback.rs#L44) manages to reconnect to it.

However, I believe there was an issue in the [`CanidateBeaconNode::refesh_status`](https://github.com/sigp/lighthouse/blob/c3a793fd73a3b11b130b82032904d39c952869e4/validator_client/src/beacon_node_fallback.rs#L157-L178) method, which is used by the updater service to see if the node has come good again. It was holding a [write lock on the `status` field](https://github.com/sigp/lighthouse/blob/c3a793fd73a3b11b130b82032904d39c952869e4/validator_client/src/beacon_node_fallback.rs#L165) whilst it polled the node status. This means a long timeout would hog the write lock and starve other processes.

When a VC is trying to access a beacon node for whatever purpose (getting duties, posting blocks, etc), it performs [three passes](https://github.com/sigp/lighthouse/blob/c3a793fd73a3b11b130b82032904d39c952869e4/validator_client/src/beacon_node_fallback.rs#L432-L482) through the lists of nodes, trying to run some generic `function` (closure, lambda, etc) on each node:

- 1st pass: only try running `function` on all nodes which are both synced and online.
- 2nd pass: try running `function` on all nodes that are online, but not necessarily synced.
- 3rd pass: for each offline node, try refreshing its status and then running `function` on it.

So, it turns out that if the `CanidateBeaconNode::refesh_status` function from the routine update service is hogging the write-lock, the 1st pass gets blocked whilst trying to read the status of the first node. So, nodes that should be left until the 3rd pass are blocking the process of the 1st and 2nd passes, hence the behaviour described in #2926.

## Additional Info

NA
@bors bors bot changed the title Avoid hogging the fallback status lock in the VC [Merged by Bors] - Avoid hogging the fallback status lock in the VC Feb 22, 2022
@bors bors bot closed this Feb 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ready-for-merge This PR is ready to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants