Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(p2p): reconnect to peer after disconnection #695

Merged
merged 15 commits into from
Dec 4, 2018
Merged

Conversation

moshababo
Copy link
Collaborator

@moshababo moshababo commented Nov 20, 2018

Closes #616, Closes #698.

If peer got disconnected from us post-handshake due to stalling or without specifying any reason, reconnect to him in case we have a listening address of him.

@moshababo moshababo requested review from a user, sangaman and offerm November 20, 2018 12:40
@ghost ghost assigned moshababo Nov 20, 2018
@ghost ghost added the in progress label Nov 20, 2018
lib/p2p/Peer.ts Outdated Show resolved Hide resolved
lib/p2p/Pool.ts Outdated Show resolved Hide resolved
lib/p2p/Pool.ts Outdated Show resolved Hide resolved
@offerm
Copy link
Contributor

offerm commented Nov 21, 2018

Use cases that should work. For all of the A connects to B in advance.

  1. Disconnect wifi/network cable between the A and B
  2. B shutdown
  3. B was killed (kill -9)

result: A should try to reconnect to B until B is back.

currently:
1 - no reconnect at all
2,3 - one time reconnect attempt.

@moshababo
Copy link
Collaborator Author

@offerm 2,3 is due to a previous bug in the connection retry logic, which introduced with pendingOutgoingConnections. not related to this task, so I opened a separate issue (#698).

In regards to 1 - if a node goes offline, it will perceive the other node as stalling, and will disconnect from it. the other node should reconnect when it will hear about it.
but if the other node is offline as well, or if it's online but it will perceive the offline node as stalling as well, before it will hear about the disconnection, so neither will try to reconnect.
Looks like it's the edge case of what defined in #616. We might need to try to reconnect from the disconnecting side as well. @kilrau

@offerm
Copy link
Contributor

offerm commented Nov 21, 2018

@moshe, Not sure why not to fix all related issues together. For 2,3 the fact that you try once does not give any value since the peer is already down. The impact is as if we are doing nothing. IMHO, just fix it.

1 is not an edge case. I'm not sure how you define "offline". If A and B are connected and A is "offline" B is also "offline" with regards to A.
What I fail to understand is why handling is different is you hear about disconnection ot just getting disconnected. In both cases, if you need to be connected, reconnect.

I also don't understand the definition of disconnecting side. There are incoming and outgoing connections and if an outgoing connection dropped we need to reconnect.

@moshababo
Copy link
Collaborator Author

moshababo commented Nov 21, 2018

Added the connection retries fix here, so you can check it on this branch.

disconnecting side is the side which initiated the disconnection by actively closing the socket. You're right that from each node perspective, if one if offline, the other is as well, so both will perceive message stalling.

The debate here is whether to reconnect after we actively closed the socket, or just to relax the stalling penalty, and keep the socket to allow recovery. If we chose the former (disconnect and then reconnect), we just need to remove Pool.ts line 657.

@moshe
Copy link

moshe commented Nov 21, 2018

@offerm you are consistently mentioning the wrong Moshe (:
you should mention moshababo instead of me

@offerm
Copy link
Contributor

offerm commented Nov 21, 2018

I don't understand this at all:

The debate here is whether to reconnect after we actively closed the socket, or just to relax the stalling penalty, and keep the socket to allow recovery. If we chose the former (disconnect and then reconnect), we just need to remove Pool.ts line 657.

Are we closing the connection beside when we shutdown? what is the most common scenario?

1, 2, 3 looks good now (from my limited testing)

I see the following two problem:

  • If I stop B with xucli shutdown A wouldn't notice that at all and will not try to reconnect (B is still running)
  • If A is trying to reconnect B after 5,10,20,40,80 seconds and during the wait time B connected to A, the connection will be closed.

@moshababo

@moshababo
Copy link
Collaborator Author

Are we closing the connection beside when we shutdown? what is the most common scenario?

we close the connection if the peer is stalling response to Ping.

If I stop B with xucli shutdown A wouldn't notice that at all and will not try to reconnect (B is still running)

If B had a shutdown, how come he's still running? And if A won't notice that because there's no connection between them, A should disconnect from B because B would be stalling Ping response.

If A is trying to reconnect B after 5,10,20,40,80 seconds and during the wait time B connected to A, the connection will be closed.

The new connection from the retries will be closed if a connection with the public key already exists. Is that a problem? We can add some logic to cut off the retry timeout before attempting a connection, but I don't think it's a big deal.

@offerm
Copy link
Contributor

offerm commented Nov 22, 2018

@moshababo:

I have no idea why B is still running after xucli shutdown. I opened an issue for this #699 . For some reason the process is still running and the connection to A is still working normally so A don't see B as stalling.

If A should connect to B, A should close current socket and reconnect B after any of the following:

  • B stopped with xucli shutdown
  • B stopped with Kill
  • B stopped with Kill -9
  • B stopped for any reason
  • B is not responding to ping
  • A has a connectivity problem
  • The IP route between A and B is not working for any reason.

A should also attempt to connect in a loop (1,5,10,20,40,80 ,etc) if the initial connection to B failed.

according to my current understanding, the only reason not to do that is if B is banned or if B is not working according to the XUD protocol. Communication failure as part of the handshake should also trigger reconnect.

Please let me know if you don't agree to the above so we can discuss it.
(I will update about the second problem).

@offerm
Copy link
Contributor

offerm commented Nov 22, 2018

@moshababo here is the log section showing the problem
At the end, there is no connection between the peers.

11/22/2018, 10:16:51 AM [P2P] debug: Connection attempt #3 to peer (xud2.test.exchangeunion.com:8885) failed: connect ECONNREFUSED 35.231.171.148:8885. retrying in 20 sec...
11/22/2018, 10:17:11 AM [P2P] debug: Connection attempt #4 to peer (xud2.test.exchangeunion.com:8885) failed: connect ECONNREFUSED 35.231.171.148:8885. retrying in 40 sec...

remote peer initiate the connection why local is waiting 40 seconds

11/22/2018, 10:17:19 AM [P2P] debug: Connected pre-handshake to peer 35.231.171.148:48390
11/22/2018, 10:17:19 AM [P2P] verbose: received hello packet from 35.231.171.148:48390: {"version":"1.0.0-alpha.4","pairs":["LTC/BTC"],"nodePubKey":"028599d05b18c0c3f8028915a1
7d603416f7276c822b6b2d20e71a3502bd0f9e0a","lndbtcPubKey":"03d6747c9ff24aa8b025027e30556897543642910cbdec066d54f86841ab80966b","lndltcPubKey":"02ee4ff018591e42bd4a01dcc0e0a18c4
1f95640db2005d796492e08fb47b40bb5","addresses":[]}
11/22/2018, 10:17:19 AM [P2P] verbose: opened connection to 028599d05b18c0c3f8028915a17d603416f7276c822b6b2d20e71a3502bd0f9e0a at 35.231.171.148:48390
11/22/2018, 10:17:19 AM [P2P] verbose: received 0 orders from 028599d05b18c0c3f8028915a17d603416f7276c822b6b2d20e71a3502bd0f9e0a
11/22/2018, 10:17:19 AM [P2P] verbose: received 0 nodes (0 new) from 028599d05b18c0c3f8028915a17d603416f7276c822b6b2d20e71a3502bd0f9e0a

No local peer wake-up and initiate connection to remote peer

11/22/2018, 10:17:51 AM [P2P] debug: Connected pre-handshake to peer xud2.test.exchangeunion.com:8885
11/22/2018, 10:17:51 AM [P2P] verbose: received hello packet from xud2.test.exchangeunion.com:8885: {"version":"1.0.0-alpha.4","pairs":["LTC/BTC"],"nodePubKey":"028599d05b18c0
c3f8028915a17d603416f7276c822b6b2d20e71a3502bd0f9e0a","lndbtcPubKey":"03d6747c9ff24aa8b025027e30556897543642910cbdec066d54f86841ab80966b","lndltcPubKey":"02ee4ff018591e42bd4a0
1dcc0e0a18c41f95640db2005d796492e08fb47b40bb5","addresses":[]}
11/22/2018, 10:17:51 AM [ORDERBOOK] debug: removed all orders for peer 028599d05b18c0c3f8028915a17d603416f7276c822b6b2d20e71a3502bd0f9e0a
11/22/2018, 10:17:51 AM [P2P] info: Completed start-up connections to known peers.
11/22/2018, 10:17:51 AM [P2P] info: Peer 028599d05b18c0c3f8028915a17d603416f7276c822b6b2d20e71a3502bd0f9e0a socket closed

Copy link
Collaborator

@sangaman sangaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving because the changes look good to me as well as the separate fix for reconnections. I'm going to look into the shutdown issue now. If there's another issue based on what offer is saying we can fix that in a followup, but I don't think it should stop this from being merged.

We can add Shutdown as a disconnection reason to automatically attempt to reconnect, but I'd be on the fence. A node that is shutdown deliberately and gracefully will connect to us when it comes back online. Maybe we should only attempt reconnections for the Shutdown reason if we ourselves are not listening for incoming connections?

It would be nice to squash this into two commits before merging for git history and changelog purposes, a feat for the new reconnection logic and a fix for the retry logic. You should be able to git rebase -i origin/master and rearrange the commits and use fixup for the review feedback commits.

lib/p2p/Pool.ts Outdated Show resolved Hide resolved
@offerm
Copy link
Contributor

offerm commented Nov 23, 2018 via email

Copy link
Contributor

@offerm offerm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See discussion.

@sangaman sangaman added the p2p Peer to peer networking label Nov 23, 2018
@moshababo
Copy link
Collaborator Author

@offerm i'm not following the fixes you're requesting.

From your logs describing the problem, what I can see is that while node A was doing connection retries, node B connected to A. So when A also connected to B, this connection was dropped, because it was duplicated.

It's what I described here:

The new connection from the retries will be closed if a connection with the public key already exists. Is that a problem? We can add some logic to cut off the retry timeout before attempting a connection, but I don't think it's a big deal.

If i'm missing something, let me know. But anyway this is not related to reconnecting after disconnection, but to the connection retries.

So the only problem that I can see now is that if both nodes will experience stalling from their perspective (due to loss of connectivity), they will both close the socket and neither will reconnect. We can fix this easily, but I still not sure that this is the problem you described earlier.

@offerm
Copy link
Contributor

offerm commented Nov 23, 2018 via email

@sangaman
Copy link
Collaborator

sangaman commented Nov 23, 2018

If we deploy like its now, the test servers will not connect each other.

I'm not following, is that due to something introduced in this PR? If not it shouldn't prevent this from being merged.

If we are behind Firewall, it will not be able to.

That's why I suggested to reconnect after Shutdown if we are not listening for incoming connections. If we are listening for incoming connections, we should be reachable (otherwise we have other problems). But I'd be OK with always reconnecting after Shutdown too if that's what you prefer.

The problem is that the end result is that there is no connection between A and B. When A try to reconnect with already connected B it will close the connection and A and B will not be connected.

This sounds like a bug if so. Maybe we need to be checking if we're already connected to a peer before we retry a connection. I can open a PR for this for the existing logic.

@sangaman
Copy link
Collaborator

I actually can't find any mechanism by which reconnecting will close an existing connection - and it's not so simple to check before retrying a connection if we already have a Peer with a duplicate pubkey because each peer does not know about other peers, but we do check for duplicate pubkeys in handleOpen in Pool.ts. So this may require further investigation.

@offerm
Copy link
Contributor

offerm commented Nov 23, 2018 via email

@moshababo
Copy link
Collaborator Author

@offerm this was an old bug with the connection retries. I missed it because I checked that the connection exists, which it did, but the peer was removed from the peer list.

I just pushed a fix for it.

Let me know if there's something missing for the reconnecting after disconnection feature.

@offerm
Copy link
Contributor

offerm commented Nov 28, 2018

@moshababo just checked this branch again.

My XUD doesn't try to reconnect after I turned off wifi and back on after a minute.

Is this working for you?

Here is the log:

11/28/2018, 3:47:14 PM [ORDERBOOK] debug: order added: {"pairId":"LTC/BTC","price":1.13,"quantity":0.0025,"isBuy":false,"hold":0,"id":"dc996580-f2d0-11e8-aa46-f92f730b7e83","initialQuantity":0.0025,"peerPubKey":"02b66438730d1fcdf4a4ae5d3d73e847a272f160fee2938e132b52cab0a0d9cfc6","createdAt":1543412834946}
11/28/2018, 3:52:54 PM [P2P] error: peer error (028599d05b18c0c3f8028915a17d603416f7276c822b6b2d20e71a3502bd0f9e0a): Peer (028599d05b18c0c3f8028915a17d603416f7276c822b6b2d20e71a3502bd0f9e0a) is stalling (e1cb9070-f314-11e8-9db3-d974557a541f)
11/28/2018, 3:52:54 PM [ORDERBOOK] debug: removed all orders for peer 028599d05b18c0c3f8028915a17d603416f7276c822b6b2d20e71a3502bd0f9e0a
11/28/2018, 3:52:54 PM [P2P] debug: closing socket with peer 028599d05b18c0c3f8028915a17d603416f7276c822b6b2d20e71a3502bd0f9e0a. reason: ResponseStalling
11/28/2018, 3:52:54 PM [P2P] error: peer error (03fd337659e99e628d0487e4f87acf93e353db06f754dccc402f2de1b857a319d0): Peer (03fd337659e99e628d0487e4f87acf93e353db06f754dccc402f2de1b857a319d0) is stalling (e1cb9071-f314-11e8-9db3-d974557a541f)
11/28/2018, 3:52:54 PM [P2P] debug: closing socket with peer 03fd337659e99e628d0487e4f87acf93e353db06f754dccc402f2de1b857a319d0. reason: ResponseStalling
11/28/2018, 3:52:54 PM [P2P] error: peer error (02b66438730d1fcdf4a4ae5d3d73e847a272f160fee2938e132b52cab0a0d9cfc6): Peer (02b66438730d1fcdf4a4ae5d3d73e847a272f160fee2938e132b52cab0a0d9cfc6) is stalling (e1cbde90-f314-11e8-9db3-d974557a541f)
11/28/2018, 3:52:54 PM [P2P] debug: closing socket with peer 02b66438730d1fcdf4a4ae5d3d73e847a272f160fee2938e132b52cab0a0d9cfc6. reason: ResponseStalling
11/28/2018, 3:52:54 PM [ORDERBOOK] debug: removed all orders for peer 03fd337659e99e628d0487e4f87acf93e353db06f754dccc402f2de1b857a319d0
11/28/2018, 3:52:54 PM [ORDERBOOK] debug: removed all orders for peer 02b66438730d1fcdf4a4ae5d3d73e847a272f160fee2938e132b52cab0a0d9cfc6
11/28/2018, 3:52:54 PM [P2P] info: Peer 02b66438730d1fcdf4a4ae5d3d73e847a272f160fee2938e132b52cab0a0d9cfc6 socket closed
11/28/2018, 3:52:54 PM [P2P] info: Peer 03fd337659e99e628d0487e4f87acf93e353db06f754dccc402f2de1b857a319d0 socket closed
11/28/2018, 3:52:54 PM [P2P] info: Peer 028599d05b18c0c3f8028915a17d603416f7276c822b6b2d20e71a3502bd0f9e0a socket closed

@offerm
Copy link
Contributor

offerm commented Nov 28, 2018

looks like the problem is due to pool.ts:665. please check

also, I see that you only reconnect if there is no disconnect reason or reason is ResponseStalling. Why not reconnect for NotAcceptingConnections and shutdown ?
IMHO we should also do it for ALL the other reasons. for example, if A should connect to B and B disconnect with IncompatibleProtocolVersion, B should probably be updated. We should keep trying to connect until success. Same with Banned - at some point we will get out of this situation, right? at that point we should reconnect.

What I suggested in the past is that we will reset the reconnect timer only after successful reconnect. This way, if we got Banned we will try to reconnect with a growing gap between each attempt.

Still pending the case when A should connect to B and B should connect to A and B is killed and restarted.

@kilrau
Copy link
Contributor

kilrau commented Nov 29, 2018

Please update us on the status here @moshababo . We ran into the same issue while testing here on our unstable Chinese connections and it makes testing real difficult.

@moshababo
Copy link
Collaborator Author

I didn't fix this yet on purpose so we can isolate this for the other problems. See my comment here.

@offerm you say you want to reconnect on all cases, but I don't understand why we'll want to have a connect-disconnect loop, which will happen if we're banned or with incompatible version.
We don't have a timer for the reconnecting - it's only for the connection retries, when the target is offline. If you want to add a support for that, we can open a separate issue.

I removed the line which disables reconnecting from the side which closed the socket. Please try again.

@offerm
Copy link
Contributor

offerm commented Nov 29, 2018

Better now. Reconnect after wifi error is working [again]

Also better on server side.

Still not perfect:

  • A will not reconnect to B when B is down with xucli shutdown
  • If A and B are connected (both can initiate the connection) and you take B down (kill), start B, wait a minute, Kill again ----> A will not attempt reconnect (reason is not cleared).
  • The reconnection process will stop after 168 hours for some reason.
  • Max time between reconnections should be (IMHO) 5 minutes

The reason why I think we should always reconnect is to avoid problems like the one above. If we keep it simple it will work better. We should reset the counter only when we made a good connection. So if there is a peer, for which we initiate the connection, and is not properly setup, we will try to connect every 5 minutes. Don't see a problem with this.

I also think we should reconnect only for outgoing peers. No need to reconnect incoming peers.

@kilrau
Copy link
Contributor

kilrau commented Dec 3, 2018

Ok, this is taking too long. Offer made some good points, I made an "improvement" offshoot: #721

But since this improves the situation quite a bit and we need it urgently for our testing, I'd prefer to get it merged soon.

@moshababo
Copy link
Collaborator Author

@offerm let's finalize this for this milestone. I returned the check on sentDisconnectionReason because it doesn't make to reconnect on any case (and it's also failing our tests). I couldn't think on any reasonable case besides ResponseStalling, but we can discuss this later. This is not related to recvDisconnectionReason, which is a different story.

@offerm
Copy link
Contributor

offerm commented Dec 3, 2018

@moshababo which tests scenarios are covered by this PR?

wifi disconnect?
peer killed?
peer shutdown?

@moshababo
Copy link
Collaborator Author

If A and B are connected (both can initiate the connection) and you take B down (kill), start B, wait a minute, Kill again ----> A will not attempt reconnect (reason is not cleared).

Saw the problem now. I'll fix it for #721 because it could depend on other things that we might want to do there.

@sangaman I made some changes to Pool.spec.ts.
First was to remove async from the main tests wrapper - it was causing some problems with logging the results.
Second was commenting out the last test. It was causing the process to hang because of the reconnecting. Specifying the disconnection reason wasn't efficient, because there was no real connection there. There are a few options of how to fix this, but I wanted to let you review it before i'll change stuff.

@offerm
Copy link
Contributor

offerm commented Dec 3, 2018

We can't merge it like this as it will cause the test servers not to be connected and maybe some other effects.

@offerm
Copy link
Contributor

offerm commented Dec 3, 2018

I have checked the simnet environment and I'm OK with merge.
We will have problems once xucli shutdown works but we can fix as part of #721

@moshababo
Copy link
Collaborator Author

@offerm Thanks for checking.

I added the fix for the peer kill scenario that you described, so we won't have this leftover. From what I understood, it was just because we didn't support connection retries when reconnecting to an inbound peer, because we didn't had his lastAddress, aka an address we were connected to. So what I did was just to do the retries on the first address from his advertised list.
If this is not the problem you described, let me know.

For the future, let's distinct between reconnecting and connection retries so we won't get confused about the feature/problem we're talking about.

@offerm
Copy link
Contributor

offerm commented Dec 3, 2018

@moshababo I'm not sure I follow your fix.

Here is what should work:
A connects to B
B is killed
A tries to reconnect to B
B starts
A connect

What should not happen:
A connects to B
A is killed
B tries to reconnect to A

Try to describe the change in this language so it will be easier to understand.

Let us also know when tests have you done.

@moshababo
Copy link
Collaborator Author

A connects to B
A is killed
B tries to reconnect to A

You said this should not happen, but at the moment B will reconnect to A if A advertised addresses. We didn't limit reconnecting to outbound connections yet.
My fix was for this scenario - if A advertised addresses and B will reconnect, then he will also do connection retries on A first advertised address. I'm not sure that this was the problem you described, but I didn't found any other issue.

1/ A connects to B
2/ B is killed
3/ A tries to reconnect to B
4/ B starts
5/ A connect

This works, but mind that after step 4, B might be the one that will connect to A (and not vise versa), in case he had A on his nodes list. So you'll end up with A having B as an inbound.

@sangaman sangaman merged commit 56fa2b4 into master Dec 4, 2018
@ghost ghost removed the in progress label Dec 4, 2018
@sangaman sangaman deleted the peer_reconnecting branch December 18, 2018 14:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p2p Peer to peer networking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants