Skip to content

[FIX] Fix retry config option #827

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Dec 3, 2020
Merged

Conversation

KonradStaniec
Copy link
Contributor

Description

Our PeerActor uses this option to retry outgoing connection when initial try fails. Setting it to 1 minute means that this actor sits and do nothing for 1 minute, while taking PeerManagerActor outgoing connection slot. It can slow down acquiring new connections. In our other project we also tweaked that option.

In longer perspective we should probably delete this mechanism, and make PeerManagerActor one thing responsible for initiating connections and re-connection.

Copy link
Contributor

@aakoshh aakoshh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 seconds seems very low for anything to change in circustances. I see the other project modified it to 30 seconds and that might work better with geth which limits incoming connection attempts from an IP to 30 seconds as well.

Another option would be to set this to 30 seconds and connect-max-retries to 0, so by default it won't waste time since we know the most probable reason for not connecting is TooManyPeers and just want to move on quickly to another node.

@aakoshh
Copy link
Contributor

aakoshh commented Dec 3, 2020

Okay I see that if the remote peer actually disconnected then this retry doesn't kick in. So it only applies if we failed to reach the other node completely?

Copy link
Contributor

@aakoshh aakoshh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on RLPxConnectionHandler it seems like the retry logic only kicks in if connection or sending a message fails, not when a disconnect message is received.

I think 5 seconds may be too low; perhaps it should be differentiated between initial connection failing vs a write failing to an already established connection, so you can quickly dismiss false leads and try to give more time to recover from temporary glitches with nodes that worked before but are perhaps restarting. But if this has a positive impact then I don't see much harm in it.

@KonradStaniec
Copy link
Contributor Author

This reconnect kicks in when rlpx connection failed either:

  • during trying to establish it or
  • if we had conneciton established and it died.

If we received disconnect then we then we just finish up.

One case it maybe useful to have the reconnect set to 1 is when we had connection established and it died (lets say remote peer restarted), then instead of going to PeerManagerActor we just re-connect from PeerActor level, although I am not sure that this case is justification for slowing down the connection establishment.

Thats why i thought 5s is good middle ground i.e

  • in case of lost connection we will try to quickly re-establish it
  • during initial establishment we will go quickly over failure

Overall as i mention whole mechanism should be improved a little, but taking into account upcoming release this change semed as low hanging fruit to speed up peer connection establishment.

@KonradStaniec KonradStaniec merged commit 23cad31 into develop Dec 3, 2020
@KonradStaniec KonradStaniec deleted the fix/fix-retry-config-option branch December 3, 2020 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants