-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(p2p): reconnect to peer after disconnection #695
Conversation
Use cases that should work. For all of the A connects to B in advance.
result: A should try to reconnect to B until B is back. currently: |
@offerm In regards to |
@moshe, Not sure why not to fix all related issues together. For 2,3 the fact that you try once does not give any value since the peer is already down. The impact is as if we are doing nothing. IMHO, just fix it.
I also don't understand the definition of |
Added the connection retries fix here, so you can check it on this branch.
The debate here is whether to reconnect after we actively closed the socket, or just to relax the stalling penalty, and keep the socket to allow recovery. If we chose the former (disconnect and then reconnect), we just need to remove |
@offerm you are consistently mentioning the wrong Moshe (: |
I don't understand this at all:
Are we closing the connection beside when we shutdown? what is the most common scenario?
I see the following two problem:
|
we close the connection if the peer is stalling response to
If B had a shutdown, how come he's still running? And if A won't notice that because there's no connection between them, A should disconnect from B because B would be stalling
The new connection from the retries will be closed if a connection with the public key already exists. Is that a problem? We can add some logic to cut off the retry timeout before attempting a connection, but I don't think it's a big deal. |
I have no idea why B is still running after If A should connect to B, A should close current socket and reconnect B after any of the following:
A should also attempt to connect in a loop (1,5,10,20,40,80 ,etc) if the initial connection to B failed. according to my current understanding, the only reason not to do that is if B is banned or if B is not working according to the XUD protocol. Communication failure as part of the handshake should also trigger reconnect. Please let me know if you don't agree to the above so we can discuss it. |
@moshababo here is the log section showing the problem
remote peer initiate the connection why local is waiting 40 seconds
No local peer wake-up and initiate connection to remote peer
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving because the changes look good to me as well as the separate fix for reconnections. I'm going to look into the shutdown issue now. If there's another issue based on what offer is saying we can fix that in a followup, but I don't think it should stop this from being merged.
We can add Shutdown
as a disconnection reason to automatically attempt to reconnect, but I'd be on the fence. A node that is shutdown deliberately and gracefully will connect to us when it comes back online. Maybe we should only attempt reconnections for the Shutdown
reason if we ourselves are not listening for incoming connections?
It would be nice to squash this into two commits before merging for git history and changelog purposes, a feat
for the new reconnection logic and a fix
for the retry logic. You should be able to git rebase -i origin/master
and rearrange the commits and use fixup
for the review feedback commits.
If we deploy like its now, the test servers will not connect each other.
I don't think we should merge this before it is working well.
A node that is shutdown will connect to us only if it can do so. If we are
behind Firewall, it will not be able to. It should also need to know how.
This is elementary P2P, why not doing it right?
Take example from lnd please.
…On Fri, 23 Nov 2018 at 7:42 Daniel McNally ***@***.***> wrote:
***@***.**** approved this pull request.
Approving because the changes look good to me as well as the separate fix
for reconnections. I'm going to look into the shutdown issue now. If
there's another issue based on what offer is saying we can fix that in a
followup, but I don't think it should stop this from being merged.
We can add Shutdown as a disconnection reason to automatically attempt to
reconnect, but I'd be on the fence. A node that is shutdown deliberately
and gracefully will connect to us when it comes back online. Maybe we
should only attempt reconnections for the Shutdown reason if we ourselves
are not listening for incoming connections?
It would be nice to squash this into two commits before merging for git
history and changelog purposes, a feat for the new reconnection logic and
a fix for the retry logic. You should be able to git rebase -i
origin/master and rearrange the commits and use fixup for the review
feedback commits.
------------------------------
In lib/p2p/Pool.ts
<#695 (comment)>:
> if (peer.nodePubKey) {
this.pendingOutgoingConnections.delete(peer.nodePubKey);
this.peers.remove(peer.nodePubKey);
}
this.emit('peer.close', peer);
+
+ // if handshake passed and peer disconnected from us for stalling or without specifying any reason -
+ // reconnect, for that might have been due to a temporary loss in connectivity
+ const legitReconnect =
I see what you're saying. It's more like whether this disconnection
happened deliberately, or due to temporary network issues. Maybe
unintentionalDisconnect or unintentionalClose might be appropriate. But
it's clear enough given the comments so I'm fine with whatever you prefer.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#695 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJQ0cr87_IgruyzfS7Ia35ZsV7WqqTwwks5ux4ragaJpZM4YrBbU>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See discussion.
@offerm i'm not following the fixes you're requesting. From your logs describing the problem, what I can see is that while node A was doing connection retries, node B connected to A. So when A also connected to B, this connection was dropped, because it was duplicated. It's what I described here:
If i'm missing something, let me know. But anyway this is not related to reconnecting after disconnection, but to the connection retries. So the only problem that I can see now is that if both nodes will experience stalling from their perspective (due to loss of connectivity), they will both close the socket and neither will reconnect. We can fix this easily, but I still not sure that this is the problem you described earlier. |
The problem is that the end result is that there is no connection between A
and B. When A try to reconnect with already connected B it will close the
connection and A and B will not be connected.
…On Fri, 23 Nov 2018 at 9:56 Moshe Shababo ***@***.***> wrote:
@offerm <https://github.com/offerm> i'm not following the fixes you're
requesting.
From your logs describing the problem, what I can see is that while node A
was doing connection retries, node B connected to A. So when A also
connected to B, this connection was dropped, because it was duplicated.
It's what I described here:
The new connection from the retries will be closed if a connection with
the public key already exists. Is that a problem? We can add some logic to
cut off the retry timeout before attempting a connection, but I don't think
it's a big deal.
If i'm missing something, let me know. But anyway this is not related to
reconnecting after disconnection, but to the connection retries.
So the only problem that I can see now is that if both nodes will
experience stalling from their perspective (due to loss of connectivity),
they will both close the socket and neither will reconnect. We can fix this
easily, but I still not sure that this is the problem you described earlier.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#695 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJQ0cjKGtC6IJf2G_Bq3wPswDPnlAjGyks5ux6ongaJpZM4YrBbU>
.
|
I'm not following, is that due to something introduced in this PR? If not it shouldn't prevent this from being merged.
That's why I suggested to reconnect after
This sounds like a bug if so. Maybe we need to be checking if we're already connected to a peer before we retry a connection. I can open a PR for this for the existing logic. |
I actually can't find any mechanism by which reconnecting will close an existing connection - and it's not so simple to check before retrying a connection if we already have a |
Yes, problem created by this PR.
Consider A and B. Both try to connect on startup. This is the situation
with the test servers. Test1 attempts to connect to test2, fail, goes to
sleep until next retry. While sleeping test2 starts and connects to test1.
When test1 wakes up and try to connect test2 rejects and close the
connection. Now they are not connected and no retry takes place.
…On Fri, 23 Nov 2018 at 17:15 Daniel McNally ***@***.***> wrote:
I actually can't find any mechanism by which reconnecting will close an
existing connection - and it's not so simple to check before retrying a
connection if we already have a Peer with a duplicate pubkey because each
peer does not know about other peers, but we do check for duplicate pubkeys
in handleOpen in Pool.ts. So this may require further investigation.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#695 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJQ0cmOErgSa_Gm5eWAe_K4Fawgf0Nwlks5uyB8zgaJpZM4YrBbU>
.
|
@offerm this was an old bug with the connection retries. I missed it because I checked that the connection exists, which it did, but the peer was removed from the peer list. I just pushed a fix for it. Let me know if there's something missing for the reconnecting after disconnection feature. |
@moshababo just checked this branch again. My XUD doesn't try to reconnect after I turned off wifi and back on after a minute. Is this working for you? Here is the log:
|
looks like the problem is due to pool.ts:665. please check also, I see that you only reconnect if there is no disconnect reason or reason is ResponseStalling. Why not reconnect for What I suggested in the past is that we will reset the reconnect timer only after successful reconnect. This way, if we got Banned we will try to reconnect with a growing gap between each attempt. Still pending the case when A should connect to B and B should connect to A and B is killed and restarted. |
Please update us on the status here @moshababo . We ran into the same issue while testing here on our unstable Chinese connections and it makes testing real difficult. |
I didn't fix this yet on purpose so we can isolate this for the other problems. See my comment here. @offerm you say you want to reconnect on all cases, but I don't understand why we'll want to have a connect-disconnect loop, which will happen if we're banned or with incompatible version. I removed the line which disables reconnecting from the side which closed the socket. Please try again. |
Better now. Reconnect after wifi error is working [again] Also better on server side. Still not perfect:
The reason why I think we should always reconnect is to avoid problems like the one above. If we keep it simple it will work better. We should reset the counter only when we made a good connection. So if there is a peer, for which we initiate the connection, and is not properly setup, we will try to connect every 5 minutes. Don't see a problem with this. I also think we should reconnect only for outgoing peers. No need to reconnect incoming peers. |
Ok, this is taking too long. Offer made some good points, I made an "improvement" offshoot: #721 But since this improves the situation quite a bit and we need it urgently for our testing, I'd prefer to get it merged soon. |
@offerm let's finalize this for this milestone. I returned the check on |
@moshababo which tests scenarios are covered by this PR? wifi disconnect? |
Saw the problem now. I'll fix it for #721 because it could depend on other things that we might want to do there. @sangaman I made some changes to |
We can't merge it like this as it will cause the test servers not to be connected and maybe some other effects. |
I have checked the simnet environment and I'm OK with merge. |
@offerm Thanks for checking. I added the fix for the peer kill scenario that you described, so we won't have this leftover. From what I understood, it was just because we didn't support connection retries when reconnecting to an inbound peer, because we didn't had his For the future, let's distinct between |
@moshababo I'm not sure I follow your fix. Here is what should work: What should not happen: Try to describe the change in this language so it will be easier to understand. Let us also know when tests have you done. |
You said this should not happen, but at the moment B will reconnect to A if A advertised addresses. We didn't limit reconnecting to outbound connections yet.
This works, but mind that after step 4, B might be the one that will connect to A (and not vise versa), in case he had A on his nodes list. So you'll end up with A having B as an inbound. |
Closes #616, Closes #698.
If peer got disconnected from us post-handshake due to stalling or without specifying any reason, reconnect to him in case we have a listening address of him.