Skip to content

NETWORKING: MockTransportService Wait for Close #35038

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Oct 31, 2018

Conversation

original-brownbear
Copy link
Contributor

@original-brownbear original-brownbear commented Oct 29, 2018

* Make `MockTransportService` wait `30s` for close listeners to run before failing the assertion
* Closes elastic#34990
@original-brownbear original-brownbear added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Network Http and internode communication implementations v7.0.0 v6.6.0 labels Oct 29, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Contributor

@Tim-Brooks Tim-Brooks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just do:

            try {
                ESTestCase.assertBusy(() -> {
                    assert openConnections.size() == 0 : "still open connections: " + openConnections;
                });
            } catch (AssertionFailure e) {
                throw e;
            } catch (Exception e) {
                throw new IOException(e);
            }

The wait / notify semantics are kind of tricky to handle (for example I think you should have the notify call be in a synchronized block).

You could also add an assertBusyUnchecked the takes a Runnable opposed to a CheckedRunnable if you want to avoid the weird exception handling

@original-brownbear
Copy link
Contributor Author

@tbrooks8 the notifyAll is inside the synchronized block :)

We would also need to do:

                ESTestCase.assertBusy(() -> {
                    synchronized(openConnections) {
                        assert openConnections.size() == 0 : "still open connections: " + openConnections;
                    }
                });

here to release to lock on the openConnections between asserts.
=> seems a lot more intuitive to not needlessly busy poll here (and add more locking complication like this) when we can just wait here without polling in a completely deterministic fashion?

@DaveCTurner
Copy link
Contributor

I am not 100% convinced that this is the right fix. #34990 seems to affect freshly-opened connections, which to me makes it feel like a race condition between opening connections and closing the transport down, and waiting for some arbitrary timeout makes me a little uneasy.

The synchronisation in MockTransportService#openConnection is quite weak and, in particular, looks like it allows for the connection to be closed (and the size of openConnections to be checked) before addCloseListener is called, and I think this explains the problem.

Would it be sufficient to call connection.addCloseListener under the same mutex as the one that protects the call to openConnections.computeIfAbsent?

@original-brownbear
Copy link
Contributor Author

@DaveCTurner I think you're 100% right, this def. another possible race. Your suggested sounds just fine too => added it in 7f5cd24

@DaveCTurner
Copy link
Contributor

With the extra synchronisation, do we need the wait/notify as well?

@original-brownbear
Copy link
Contributor Author

@DaveCTurner yea because the close callback may be executed in a different thread and hence could run after the assertion if we don't wait for the map to empty.

@DaveCTurner
Copy link
Contributor

Do you mean that the call to super.doClose(); in MockTransportService#doClose() can return before all the connections have been fully closed? That sounds bad. I see this comment:

// we are holding a write lock so nobody modifies the connectedNodes / openConnections map - it's safe to first close
// all instances and then clear them maps

This doesn't look true - I can see a couple of places where we modify connectedNodes without protection from closeLock:

Transport.Connection nodeChannels = connectedNodes.remove(node);

@original-brownbear
Copy link
Contributor Author

@DaveCTurner no this is my bad (sorry for the confusion). I mixed up the connection and the channel close. The connection close callback is invoked in the same thread that invokes. doClose() => no need to wait for the channel list to empty and just synchronizing the connect itself should be fine.

Simplified the PR in fb843f5.
Also, will look into the issue you point out above (that shouldn't be relevant here though, since the connections are just a field managed by MockTransportService so as long as the callback and the close run in the same thread we're good I think).

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but I'd like @tbrooks8's opinion too.

@original-brownbear
Copy link
Contributor Author

@DaveCTurner I think the two concurrency spots in ConnectionManager are ok the way they are:

closeLock.readLock().lock(); // ensure we don't open connections while we are closing

gives us the close lock for

And for:

Transport.Connection nodeChannels = connectedNodes.remove(node);

we should be fine too. The iterator from a CHM should be consistent with concurrent removal from the CHM (remove one iterator should simply be a noop if the element was concurrently removed?).

@DaveCTurner
Copy link
Contributor

I think the issue might just be that the comment is misleading, as comments often are. Both of the calls to connectedNodes.remove() look OK; there'd only be a problem if we were to concurrently add a connected node if I'm reading this right. Could you fix the comment so it only talks about adding to connectedNodes?

This line isn't under a lock by the way, it's in a handler that may not run until the lock is released:

@original-brownbear
Copy link
Contributor Author

@DaveCTurner

This line isn't under a lock by the way, it's in a handler that may not run until the lock is released:

Sorry, you're right my eyes aren't so great today apparently :D

Could you fix the comment so it only talks about adding to connectedNodes?

Fixed :)

@Tim-Brooks
Copy link
Contributor

The synchronisation in MockTransportService#openConnection is quite weak and, in particular, looks like it allows for the connection to be closed (and the size of openConnections to be checked) before addCloseListener is called, and I think this explains the problem.

Can someone explain this hypothesis before I review this?

I don't understand how this can lead to this issue. Even if the channel is closed before the listener added, the listener is still called. The listener is just executed by the thread that is adding the listener.

So I guess my point is that the additional synchronization does not seem to add much here. I'm not opposed to it as this is just test code, I just want to make sure I understand the proposed issues at play here.

Do you mean that the call to super.doClose(); in MockTransportService#doClose() can return before all the connections have been fully closed?

YESish. The assertions introduced in the MockTransportService only apply to unmanaged connections. These connections are not managed by the ConnectionManager. Now at a low-level killing the transport service kills event loops which kills all tcp connections. And there are listeners that then close the node connection (which is a collection of tcp connections). But this is a race with other places that could be closing the node connection. And whatever closes the node connection first wins and executes the listener. This is why some type of wait or assert busy is necessary as someone outside of the networking code might concurrently be closing this node connection. Also why we cannot make assumptions about which threads listeners are executed on.

As a separate note, I want to point out that discussions about the ConnectionManager are relatively orthogonal to the issues at play here as these assertions apply to unmanaged connections.

@original-brownbear
Copy link
Contributor Author

@tbrooks8

The listener is just executed by the thread that is adding the listener.

At least I did not know that, sorry about the confusion here.

So I guess my point is that the additional synchronization does not seem to add much here. I'm not opposed to it as this is just test code, I just want to make sure I understand the proposed issues at play here.

Yea, if you're ok with it I'd vote to keep it. It makes the whole thing a lot easier to think through (IMO).

But this is a race with other places that could be closing the node connection. And whatever closes the node connection first wins and executes the listener. This is why some type of wait or assert busy is necessary as someone outside of the networking code might concurrently be closing this node connection. Also why we cannot make assumptions about which threads listeners are executed on.

Ah, that makes perfect sense :) Thanks! (and sorry for the annoying back and forth here, but now it's all much clearer to me :))

=> I added the wait back now in ce879da

@DaveCTurner
Copy link
Contributor

Can someone explain this hypothesis before I review this?

Sure...

I don't understand how this can lead to this issue. Even if the channel is closed before the listener added, the listener is still called. The listener is just executed by the thread that is adding the listener.

The issue seems to be that we can:

  1. start to execute openConnection and get as far as openConnections.computeIfAbsent() but not as far as connection.addCloseListener().
  2. close the transport service, which closes all the open connections.
  3. only then call connection.addCloseListener() which executes the listener immediately and removes it from openConnections.

The failing assertionopenConnections.size() == 0 is checked in step 2, before we've tried to add the close listener, so there's nothing to remove the connection from openConnections so the assertion fails.

As a separate note, I want to point out that discussions about the ConnectionManager are relatively orthogonal to the issues at play here as these assertions apply to unmanaged connections.

Indeed - I wasn't tracing things very accurately.

Copy link
Contributor

@Tim-Brooks Tim-Brooks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. I think we are all agreement about the potential problems here and how the PR addresses them.

LGTM

@original-brownbear original-brownbear merged commit e6f9f06 into elastic:master Oct 31, 2018
@original-brownbear original-brownbear deleted the 34990 branch October 31, 2018 20:33
@original-brownbear
Copy link
Contributor Author

@tbrooks8 thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Network Http and internode communication implementations >test Issues or PRs that are addressing/adding tests v5.6.14 v6.6.0 v7.0.0-beta1
Projects
None yet
5 participants