Skip to content

Fix error case in node connection procedure. #246

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 6, 2015
Merged

Conversation

qnikst
Copy link
Contributor

@qnikst qnikst commented Jul 3, 2015

In case if new connection to remote EndPoint failed to create,
this means that all reliable connections (that are used in d-p)
also failed. As a result this means that we need to emit
Died (node) Disconnect event. Previously we emited
Died typeOfReceiver Disconnect event.

In case if new connection to remote EndPoint failed to create,
this means that all reliable connections (that are used in d-p)
also failed. As a result this means that we need to emit
`Died (node) Disconnect` event. Previously we emited
`Died typeOfReceiver Disconnect` event.
@mboes
Copy link
Contributor

mboes commented Jul 3, 2015

This should come with a regression test.

@qnikst
Copy link
Contributor Author

qnikst commented Jul 3, 2015

This test fixes test suite, it no longer randomly hangs (on TCP) and stop failing on inmemory. Does it count as a regression test, or we need explicit one?

@mboes
Copy link
Contributor

mboes commented Jul 3, 2015

Which test(s) in the test suite? It ought to be possible to write a test that consistently hangs currently, on consistently does not hang with this patch.

@qnikst
Copy link
Contributor Author

qnikst commented Jul 3, 2015

I'll think how to write such test, not sure that I see a simple way now.

@facundominguez
Copy link
Contributor

LGTM

qnikst added a commit that referenced this pull request Jul 6, 2015
Fix error case in node connection procedure.
@qnikst qnikst merged commit 14227fd into master Jul 6, 2015
@qnikst qnikst deleted the fix-node-monitoring branch July 6, 2015 08:55
@mboes
Copy link
Contributor

mboes commented Jul 17, 2015

@facundominguez So is this a workaround that should be backed out once the tests are fixed, are did we merge a semantic fix here?

@qnikst
Copy link
Contributor Author

qnikst commented Jul 17, 2015

This is a workaround in case if EventConnectionLost message is guaranteed to happen in case of connection/send failure by n-t semantics.

@mboes
Copy link
Contributor

mboes commented Jul 17, 2015

I don't understand the above comment. You mean, we decree that the node disconnected in the case where we know all connections to some given endpoint died all at once? Just in case we never do get the EventConnectionLost event?

@qnikst
Copy link
Contributor Author

qnikst commented Jul 17, 2015

No. I mean that if we try to connect to another node using reliable ordered connection (CH case) and connect failed this means that connectivity with remote endpoint failed. It means that node connection to Node should be marked as broken and monitors to node and to processes on this node should receive notifications.
This may happen in two cases:

  • EventConnectionLost for remote endpoint is guaranteed to arrive after connect by n-t semantics and we have relevant test cases and all major n-t-implementations passes this test.
  • We have workaround from this patch

Am I clear now?

@mboes
Copy link
Contributor

mboes commented Jul 17, 2015

Yes, that is a bit clearer. I think the workaround is wrong.

  1. didSend may be false not just because connect failed. It could also be because NT.send failed. And that can fail even though the other connections on the endpoint are still up and working fine.
  2. If the heavyweight connection really did go down we should be waiting for an EventConnectionLost, not anticipate or paper over it.

@mboes
Copy link
Contributor

mboes commented Jul 17, 2015

Hm, forget point (1) - there is the "bundle" story of course. Which was up for discussion previously.

@mboes
Copy link
Contributor

mboes commented Jul 17, 2015

Ok yes - point (1) is valid I think: the send didn't necessarily fail because of a connection failure. Signing out now, because second guessing after dinner messages I wrote before dinner not working too well.

@qnikst
Copy link
Contributor Author

qnikst commented Jul 18, 2015

  1. I'd agree with points if we have enough information to distinguish bundles, and currently we have all-or-nothing approach. quote from n-t API

Although Network.Transport provides multiple independent lightweight connections between endpoints, those connections cannot fail independently: once one connection has failed, all connections, in both directions, must now be considered to have failed; they fail as a "bundle" of connections, with only a single "bundle" of connections per endpoint at any point in time.

That is, suppose there are multiple connections in either direction between endpoints A and B, and A receives a notification that it has lost contact with B. Then A must not be able to send any further messages to B on existing connections.

And error on send means connection failure, this means that all "bundle" goes down, and as we have no way to distinguish between a bundles this should mean that all "bundles" goes down. Seems like that chapter states that guarantees are very similar to what I wanted, and this fix will be redundant then. On the other side seems that this fix will not play well with unreliable sends and monitoring. But I think we need to introduce a desirable semantics and tests for that as a separate task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants