Fix error case in node connection procedure. #246

qnikst · 2015-07-03T15:15:00Z

In case if new connection to remote EndPoint failed to create,
this means that all reliable connections (that are used in d-p)
also failed. As a result this means that we need to emit
Died (node) Disconnect event. Previously we emited
Died typeOfReceiver Disconnect event.

In case if new connection to remote EndPoint failed to create, this means that all reliable connections (that are used in d-p) also failed. As a result this means that we need to emit `Died (node) Disconnect` event. Previously we emited `Died typeOfReceiver Disconnect` event.

mboes · 2015-07-03T16:38:18Z

This should come with a regression test.

qnikst · 2015-07-03T16:43:21Z

This test fixes test suite, it no longer randomly hangs (on TCP) and stop failing on inmemory. Does it count as a regression test, or we need explicit one?

mboes · 2015-07-03T16:54:03Z

Which test(s) in the test suite? It ought to be possible to write a test that consistently hangs currently, on consistently does not hang with this patch.

qnikst · 2015-07-03T16:56:25Z

I'll think how to write such test, not sure that I see a simple way now.

facundominguez · 2015-07-03T19:42:50Z

LGTM

Fix error case in node connection procedure.

mboes · 2015-07-17T15:00:05Z

@facundominguez So is this a workaround that should be backed out once the tests are fixed, are did we merge a semantic fix here?

qnikst · 2015-07-17T15:27:21Z

This is a workaround in case if EventConnectionLost message is guaranteed to happen in case of connection/send failure by n-t semantics.

mboes · 2015-07-17T16:06:24Z

I don't understand the above comment. You mean, we decree that the node disconnected in the case where we know all connections to some given endpoint died all at once? Just in case we never do get the EventConnectionLost event?

qnikst · 2015-07-17T16:16:38Z

No. I mean that if we try to connect to another node using reliable ordered connection (CH case) and connect failed this means that connectivity with remote endpoint failed. It means that node connection to Node should be marked as broken and monitors to node and to processes on this node should receive notifications.
This may happen in two cases:

EventConnectionLost for remote endpoint is guaranteed to arrive after connect by n-t semantics and we have relevant test cases and all major n-t-implementations passes this test.
We have workaround from this patch

Am I clear now?

mboes · 2015-07-17T20:14:48Z

Yes, that is a bit clearer. I think the workaround is wrong.

didSend may be false not just because connect failed. It could also be because NT.send failed. And that can fail even though the other connections on the endpoint are still up and working fine.
If the heavyweight connection really did go down we should be waiting for an EventConnectionLost, not anticipate or paper over it.

mboes · 2015-07-17T20:16:28Z

Hm, forget point (1) - there is the "bundle" story of course. Which was up for discussion previously.

mboes · 2015-07-17T20:20:44Z

Ok yes - point (1) is valid I think: the send didn't necessarily fail because of a connection failure. Signing out now, because second guessing after dinner messages I wrote before dinner not working too well.

qnikst · 2015-07-18T18:07:19Z

I'd agree with points if we have enough information to distinguish bundles, and currently we have all-or-nothing approach. quote from n-t API

Although Network.Transport provides multiple independent lightweight connections between endpoints, those connections cannot fail independently: once one connection has failed, all connections, in both directions, must now be considered to have failed; they fail as a "bundle" of connections, with only a single "bundle" of connections per endpoint at any point in time.

That is, suppose there are multiple connections in either direction between endpoints A and B, and A receives a notification that it has lost contact with B. Then A must not be able to send any further messages to B on existing connections.

And error on send means connection failure, this means that all "bundle" goes down, and as we have no way to distinguish between a bundles this should mean that all "bundles" goes down. Seems like that chapter states that guarantees are very similar to what I wanted, and this fix will be redundant then. On the other side seems that this fix will not play well with unreliable sends and monitoring. But I think we need to introduce a desirable semantics and tests for that as a separate task.

qnikst mentioned this pull request Jul 3, 2015

Run tests using network-transport-inmemory haskell-distributed/distributed-process-tests#19

Merged

qnikst added this to the distributed-process-0.6 milestone Jul 3, 2015

qnikst assigned facundominguez Jul 3, 2015

qnikst added a commit that referenced this pull request Jul 6, 2015

Merge pull request #246 from haskell-distributed/fix-node-monitoring

14227fd

Fix error case in node connection procedure.

qnikst merged commit 14227fd into master Jul 6, 2015

qnikst deleted the fix-node-monitoring branch July 6, 2015 08:55

facundominguez mentioned this pull request Jul 17, 2015

Fix testBreakConnection and undo workaround #250

Open

facundominguez mentioned this pull request Nov 2, 2015

Provide a way to break connections #434

Open

qnikst modified the milestones: distributed-process-0.6, distributed-process-0.7 Feb 24, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix error case in node connection procedure. #246

Fix error case in node connection procedure. #246

Uh oh!

qnikst commented Jul 3, 2015

Uh oh!

mboes commented Jul 3, 2015

Uh oh!

qnikst commented Jul 3, 2015

Uh oh!

mboes commented Jul 3, 2015

Uh oh!

qnikst commented Jul 3, 2015

Uh oh!

facundominguez commented Jul 3, 2015

Uh oh!

mboes commented Jul 17, 2015

Uh oh!

qnikst commented Jul 17, 2015

Uh oh!

mboes commented Jul 17, 2015

Uh oh!

qnikst commented Jul 17, 2015

Uh oh!

mboes commented Jul 17, 2015

Uh oh!

mboes commented Jul 17, 2015

Uh oh!

mboes commented Jul 17, 2015

Uh oh!

qnikst commented Jul 18, 2015

Uh oh!

Uh oh!

Fix error case in node connection procedure. #246

Fix error case in node connection procedure. #246

Uh oh!

Conversation

qnikst commented Jul 3, 2015

Uh oh!

mboes commented Jul 3, 2015

Uh oh!

qnikst commented Jul 3, 2015

Uh oh!

mboes commented Jul 3, 2015

Uh oh!

qnikst commented Jul 3, 2015

Uh oh!

facundominguez commented Jul 3, 2015

Uh oh!

mboes commented Jul 17, 2015

Uh oh!

qnikst commented Jul 17, 2015

Uh oh!

mboes commented Jul 17, 2015

Uh oh!

qnikst commented Jul 17, 2015

Uh oh!

mboes commented Jul 17, 2015

Uh oh!

mboes commented Jul 17, 2015

Uh oh!

mboes commented Jul 17, 2015

Uh oh!

qnikst commented Jul 18, 2015

Uh oh!

Uh oh!