Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve peer performance for NAT'd nodes #5345

Merged
merged 4 commits into from
Mar 7, 2024

Conversation

AgeManning
Copy link
Member

The best way for lighthouse to manage peers is via pruning excess peers in order to get the best possible steady-state selection of peers long-term.

This process has always been reserved for nodes that have correctly port-forwarded their lighthouse node. Nodes that are behind NAT's never get excess peers and suffer from not having this peer selection logic.

Often these nodes will see InsufficientPeers warning occasionally because their peer set has not been optimized.

This PR adds logic to allow discovery to search for more peers than necessary if the inbound peer count is small (which indicates the lighthouse node is unreachable globally). By searching for and subsequently dialing excess peers, the pruning logic will now also apply for these lighthouse nodes giving them a better set of peers.

@AgeManning AgeManning added the ready-for-review The code is ready for review label Mar 4, 2024
@AgeManning AgeManning requested review from jxs and pawanjay176 March 4, 2024 06:28
@AgeManning
Copy link
Member Author

The downside is that after pruning, this logic won't be triggered again until a peer disconnects or the peer count falls below the target.
This will help NAT'd peers, as I notice disconnects happening all the time. The best solution is to port-forward correctly.

Open to suggestions if people want to invoke discovery more often, but i'm conscious of having it over-used, it's very noisy.

Copy link
Member

@pawanjay176 pawanjay176 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the root cause of our issues, but should help non-port forwarded peers nevertheless.

@@ -61,6 +61,8 @@ pub const MIN_OUTBOUND_ONLY_FACTOR: f32 = 0.2;
/// limit is 55, and we are at 55 peers, the following parameter provisions a few more slots of
/// dialing priority peers we need for validator duties.
pub const PRIORITY_PEER_EXCESS: f32 = 0.2;
/// The number of inbound peers that are connected that indicate this node is not behind a NAT.
pub const INBOUND_PEERS_NAT: usize = 5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 5? How's the node getting any inbound peers at all if it does not have a reachable address?!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the node may still be reached by peers that are public, i.e. the node dials a public peer so the router adds that public address to it's table and therefore it may be dialed from them in the future right?

Copy link
Member

@jxs jxs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the logic seems sound, just have the same question as Pawan

@@ -61,6 +61,8 @@ pub const MIN_OUTBOUND_ONLY_FACTOR: f32 = 0.2;
/// limit is 55, and we are at 55 peers, the following parameter provisions a few more slots of
/// dialing priority peers we need for validator duties.
pub const PRIORITY_PEER_EXCESS: f32 = 0.2;
/// The number of inbound peers that are connected that indicate this node is not behind a NAT.
pub const INBOUND_PEERS_NAT: usize = 5;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the node may still be reached by peers that are public, i.e. the node dials a public peer so the router adds that public address to it's table and therefore it may be dialed from them in the future right?

@AgeManning
Copy link
Member Author

Yeah I came here to change it. The original number was just something I made up. It was some threshold.
I was worried about the case that perhaps the NAT was open at one point in time and later closed, maybe UPnP or something. Previous peers would slowly disconnect, but some may stay long-term.

I think the better logic now is to just always discover up to max_peers, regardless of inbound peers. We have been seeing cases of non-NAT'd peers not reaching the excess, so might as well always dial up to the max and let the peer manager prune down.

@AgeManning
Copy link
Member Author

Ok, diff is really small now, but impact could be substantial

@AgeManning AgeManning added ready-for-merge This PR is ready to merge. and removed ready-for-review The code is ready for review labels Mar 7, 2024
@AgeManning
Copy link
Member Author

@Mergifyio queue

Copy link

mergify bot commented Mar 7, 2024

queue

🛑 Command queue cancelled because of a new queue command with different arguments

Copy link

mergify bot commented Mar 7, 2024

queue

🟠 The pull request is the 2nd in the queue to be merged

#5345 is queued for merge.

Required conditions of queue default for merge:

  • check-success=local-testnet-success
  • check-success=test-suite-success

Required conditions to stay in the queue:

  • -draft [📌 queue requirement]
  • any of: [📌 queue -> configuration change requirements]
    • -mergify-configuration-changed
    • check-success=Configuration changed
  • any of: [🔀 queue conditions]
    • all of: [📌 queue conditions of queue default]
      • #approved-reviews-by >= 1
      • check-success=license/cla
      • check-success=target-branch-check

Visit the Mergify Dashboard to check the state of the queue default.

@pawanjay176
Copy link
Member

@Mergifyio dequeue

Copy link

mergify bot commented Mar 7, 2024

dequeue

✅ The pull request has been removed from the queue default

@pawanjay176
Copy link
Member

@Mergifyio requeue

Copy link

mergify bot commented Mar 7, 2024

requeue

✅ This pull request will be re-embarked automatically

The followup queue command will be automatically executed to re-embark the pull request

Copy link

mergify bot commented Mar 7, 2024

queue

✅ The pull request has been merged automatically

The pull request has been merged automatically at de91c77

@mergify mergify bot merged commit de91c77 into sigp:unstable Mar 7, 2024
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-for-merge This PR is ready to merge.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants