-
Notifications
You must be signed in to change notification settings - Fork 792
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Merged by Bors] - Avoid hogging the fallback status
lock in the VC
#3022
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
michaelsproul
approved these changes
Feb 22, 2022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice find
michaelsproul
added
ready-for-merge
This PR is ready to merge.
and removed
ready-for-review
The code is ready for review
labels
Feb 22, 2022
bors r+ |
bors bot
pushed a commit
that referenced
this pull request
Feb 22, 2022
## Issue Addressed Addresses #2926 ## Proposed Changes Appropriated from #2926 (comment): When a node returns *any* error we call [`CandidateBeaconNode::set_offline`](https://github.com/sigp/lighthouse/blob/c3a793fd73a3b11b130b82032904d39c952869e4/validator_client/src/beacon_node_fallback.rs#L424) which sets it's `status` to `CandidateError::Offline`. That node will then be ignored until the routine [`fallback_updater_service`](https://github.com/sigp/lighthouse/blob/c3a793fd73a3b11b130b82032904d39c952869e4/validator_client/src/beacon_node_fallback.rs#L44) manages to reconnect to it. However, I believe there was an issue in the [`CanidateBeaconNode::refesh_status`](https://github.com/sigp/lighthouse/blob/c3a793fd73a3b11b130b82032904d39c952869e4/validator_client/src/beacon_node_fallback.rs#L157-L178) method, which is used by the updater service to see if the node has come good again. It was holding a [write lock on the `status` field](https://github.com/sigp/lighthouse/blob/c3a793fd73a3b11b130b82032904d39c952869e4/validator_client/src/beacon_node_fallback.rs#L165) whilst it polled the node status. This means a long timeout would hog the write lock and starve other processes. When a VC is trying to access a beacon node for whatever purpose (getting duties, posting blocks, etc), it performs [three passes](https://github.com/sigp/lighthouse/blob/c3a793fd73a3b11b130b82032904d39c952869e4/validator_client/src/beacon_node_fallback.rs#L432-L482) through the lists of nodes, trying to run some generic `function` (closure, lambda, etc) on each node: - 1st pass: only try running `function` on all nodes which are both synced and online. - 2nd pass: try running `function` on all nodes that are online, but not necessarily synced. - 3rd pass: for each offline node, try refreshing its status and then running `function` on it. So, it turns out that if the `CanidateBeaconNode::refesh_status` function from the routine update service is hogging the write-lock, the 1st pass gets blocked whilst trying to read the status of the first node. So, nodes that should be left until the 3rd pass are blocking the process of the 1st and 2nd passes, hence the behaviour described in #2926. ## Additional Info NA
Pull request successfully merged into unstable. Build succeeded: |
bors
bot
changed the title
Avoid hogging the fallback
[Merged by Bors] - Avoid hogging the fallback Feb 22, 2022
status
lock in the VCstatus
lock in the VC
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue Addressed
Addresses #2926
Proposed Changes
Appropriated from #2926 (comment):
When a node returns any error we call
CandidateBeaconNode::set_offline
which sets it'sstatus
toCandidateError::Offline
. That node will then be ignored until the routinefallback_updater_service
manages to reconnect to it.However, I believe there was an issue in the
CanidateBeaconNode::refesh_status
method, which is used by the updater service to see if the node has come good again. It was holding a write lock on thestatus
field whilst it polled the node status. This means a long timeout would hog the write lock and starve other processes.When a VC is trying to access a beacon node for whatever purpose (getting duties, posting blocks, etc), it performs three passes through the lists of nodes, trying to run some generic
function
(closure, lambda, etc) on each node:function
on all nodes which are both synced and online.function
on all nodes that are online, but not necessarily synced.function
on it.So, it turns out that if the
CanidateBeaconNode::refesh_status
function from the routine update service is hogging the write-lock, the 1st pass gets blocked whilst trying to read the status of the first node. So, nodes that should be left until the 3rd pass are blocking the process of the 1st and 2nd passes, hence the behaviour described in #2926.Additional Info
NA