NETWORKING: MockTransportService Wait for Close #35038

original-brownbear · 2018-10-29T15:05:09Z

Make MockTransportService wait 30s for close listeners to run before failing the assertion
Closes MockTransportService throws AssertionError: still open connections #34990
Closes RemoteClusterConnectionTests#testCloseWhileConcurrentlyConnecting leaks a connection #33756

* Make `MockTransportService` wait `30s` for close listeners to run before failing the assertion * Closes elastic#34990

elasticmachine · 2018-10-29T15:05:11Z

Pinging @elastic/es-distributed

Tim-Brooks

Can we just do:

            try {
                ESTestCase.assertBusy(() -> {
                    assert openConnections.size() == 0 : "still open connections: " + openConnections;
                });
            } catch (AssertionFailure e) {
                throw e;
            } catch (Exception e) {
                throw new IOException(e);
            }

The wait / notify semantics are kind of tricky to handle (for example I think you should have the notify call be in a synchronized block).

You could also add an assertBusyUnchecked the takes a Runnable opposed to a CheckedRunnable if you want to avoid the weird exception handling

original-brownbear · 2018-10-30T03:19:19Z

@tbrooks8 the notifyAll is inside the synchronized block :)

We would also need to do:

                ESTestCase.assertBusy(() -> {
                    synchronized(openConnections) {
                        assert openConnections.size() == 0 : "still open connections: " + openConnections;
                    }
                });

here to release to lock on the openConnections between asserts.
=> seems a lot more intuitive to not needlessly busy poll here (and add more locking complication like this) when we can just wait here without polling in a completely deterministic fashion?

DaveCTurner · 2018-10-30T09:09:29Z

I am not 100% convinced that this is the right fix. #34990 seems to affect freshly-opened connections, which to me makes it feel like a race condition between opening connections and closing the transport down, and waiting for some arbitrary timeout makes me a little uneasy.

The synchronisation in MockTransportService#openConnection is quite weak and, in particular, looks like it allows for the connection to be closed (and the size of openConnections to be checked) before addCloseListener is called, and I think this explains the problem.

Would it be sufficient to call connection.addCloseListener under the same mutex as the one that protects the call to openConnections.computeIfAbsent?

original-brownbear · 2018-10-30T10:10:26Z

@DaveCTurner I think you're 100% right, this def. another possible race. Your suggested sounds just fine too => added it in 7f5cd24

DaveCTurner · 2018-10-30T10:19:52Z

With the extra synchronisation, do we need the wait/notify as well?

original-brownbear · 2018-10-30T10:21:05Z

@DaveCTurner yea because the close callback may be executed in a different thread and hence could run after the assertion if we don't wait for the map to empty.

DaveCTurner · 2018-10-30T11:09:00Z

Do you mean that the call to super.doClose(); in MockTransportService#doClose() can return before all the connections have been fully closed? That sounds bad. I see this comment:

elasticsearch/server/src/main/java/org/elasticsearch/transport/ConnectionManager.java

Lines 206 to 207 in cf9aff9

    
           // we are holding a write lock so nobody modifies the connectedNodes / openConnections map - it's safe to first close 
        
           // all instances and then clear them maps

This doesn't look true - I can see a couple of places where we modify connectedNodes without protection from closeLock:

elasticsearch/server/src/main/java/org/elasticsearch/transport/ConnectionManager.java

Line 131 in cf9aff9

connectedNodes.remove(node, finalConnection);

elasticsearch/server/src/main/java/org/elasticsearch/transport/ConnectionManager.java

Line 182 in cf9aff9

Transport.Connection nodeChannels = connectedNodes.remove(node);

original-brownbear · 2018-10-30T12:10:54Z

@DaveCTurner no this is my bad (sorry for the confusion). I mixed up the connection and the channel close. The connection close callback is invoked in the same thread that invokes. doClose() => no need to wait for the channel list to empty and just synchronizing the connect itself should be fine.

Simplified the PR in fb843f5.
Also, will look into the issue you point out above (that shouldn't be relevant here though, since the connections are just a field managed by MockTransportService so as long as the callback and the close run in the same thread we're good I think).

DaveCTurner

LGTM but I'd like @tbrooks8's opinion too.

original-brownbear · 2018-10-30T13:38:29Z

@DaveCTurner I think the two concurrency spots in ConnectionManager are ok the way they are:

elasticsearch/server/src/main/java/org/elasticsearch/transport/ConnectionManager.java

Line 109 in cf9aff9

    
           closeLock.readLock().lock(); // ensure we don't open connections while we are closing

gives us the close lock for

elasticsearch/server/src/main/java/org/elasticsearch/transport/ConnectionManager.java

Line 131 in cf9aff9

connectedNodes.remove(node, finalConnection);

And for:

elasticsearch/server/src/main/java/org/elasticsearch/transport/ConnectionManager.java

Line 182 in cf9aff9

Transport.Connection nodeChannels = connectedNodes.remove(node);

we should be fine too. The iterator from a CHM should be consistent with concurrent removal from the CHM (remove one iterator should simply be a noop if the element was concurrently removed?).

DaveCTurner · 2018-10-30T14:06:14Z

I think the issue might just be that the comment is misleading, as comments often are. Both of the calls to connectedNodes.remove() look OK; there'd only be a problem if we were to concurrently add a connected node if I'm reading this right. Could you fix the comment so it only talks about adding to connectedNodes?

This line isn't under a lock by the way, it's in a handler that may not run until the lock is released:

elasticsearch/server/src/main/java/org/elasticsearch/transport/ConnectionManager.java

Line 131 in cf9aff9

connectedNodes.remove(node, finalConnection);

original-brownbear · 2018-10-30T16:05:55Z

@DaveCTurner

This line isn't under a lock by the way, it's in a handler that may not run until the lock is released:

Sorry, you're right my eyes aren't so great today apparently :D

Could you fix the comment so it only talks about adding to connectedNodes?

Fixed :)

Tim-Brooks · 2018-10-30T17:01:49Z

The synchronisation in MockTransportService#openConnection is quite weak and, in particular, looks like it allows for the connection to be closed (and the size of openConnections to be checked) before addCloseListener is called, and I think this explains the problem.

Can someone explain this hypothesis before I review this?

I don't understand how this can lead to this issue. Even if the channel is closed before the listener added, the listener is still called. The listener is just executed by the thread that is adding the listener.

So I guess my point is that the additional synchronization does not seem to add much here. I'm not opposed to it as this is just test code, I just want to make sure I understand the proposed issues at play here.

Do you mean that the call to super.doClose(); in MockTransportService#doClose() can return before all the connections have been fully closed?

YESish. The assertions introduced in the MockTransportService only apply to unmanaged connections. These connections are not managed by the ConnectionManager. Now at a low-level killing the transport service kills event loops which kills all tcp connections. And there are listeners that then close the node connection (which is a collection of tcp connections). But this is a race with other places that could be closing the node connection. And whatever closes the node connection first wins and executes the listener. This is why some type of wait or assert busy is necessary as someone outside of the networking code might concurrently be closing this node connection. Also why we cannot make assumptions about which threads listeners are executed on.

As a separate note, I want to point out that discussions about the ConnectionManager are relatively orthogonal to the issues at play here as these assertions apply to unmanaged connections.

original-brownbear · 2018-10-30T17:20:25Z

@tbrooks8

The listener is just executed by the thread that is adding the listener.

At least I did not know that, sorry about the confusion here.

So I guess my point is that the additional synchronization does not seem to add much here. I'm not opposed to it as this is just test code, I just want to make sure I understand the proposed issues at play here.

Yea, if you're ok with it I'd vote to keep it. It makes the whole thing a lot easier to think through (IMO).

But this is a race with other places that could be closing the node connection. And whatever closes the node connection first wins and executes the listener. This is why some type of wait or assert busy is necessary as someone outside of the networking code might concurrently be closing this node connection. Also why we cannot make assumptions about which threads listeners are executed on.

Ah, that makes perfect sense :) Thanks! (and sorry for the annoying back and forth here, but now it's all much clearer to me :))

=> I added the wait back now in ce879da

DaveCTurner · 2018-10-30T17:25:18Z

Can someone explain this hypothesis before I review this?

Sure...

I don't understand how this can lead to this issue. Even if the channel is closed before the listener added, the listener is still called. The listener is just executed by the thread that is adding the listener.

The issue seems to be that we can:

start to execute openConnection and get as far as openConnections.computeIfAbsent() but not as far as connection.addCloseListener().
close the transport service, which closes all the open connections.
only then call connection.addCloseListener() which executes the listener immediately and removes it from openConnections.

The failing assertionopenConnections.size() == 0 is checked in step 2, before we've tried to add the close listener, so there's nothing to remove the connection from openConnections so the assertion fails.

As a separate note, I want to point out that discussions about the ConnectionManager are relatively orthogonal to the issues at play here as these assertions apply to unmanaged connections.

Indeed - I wasn't tracing things very accurately.

Tim-Brooks

Okay. I think we are all agreement about the potential problems here and how the PR addresses them.

LGTM

original-brownbear · 2018-10-31T20:33:59Z

@tbrooks8 thanks!

NETWORKING: MockTransportService Wait for Close

de77237

* Make `MockTransportService` wait `30s` for close listeners to run before failing the assertion * Closes elastic#34990

original-brownbear added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Network Http and internode communication implementations v7.0.0 v6.6.0 labels Oct 29, 2018

original-brownbear mentioned this pull request Oct 29, 2018

MockTransportService throws AssertionError: still open connections #34990

Closed

original-brownbear requested a review from Tim-Brooks October 29, 2018 15:10

original-brownbear added the v5.6.14 label Oct 29, 2018

original-brownbear requested review from s1monw and jasontedor October 29, 2018 19:03

Tim-Brooks requested changes Oct 30, 2018

View reviewed changes

original-brownbear mentioned this pull request Oct 30, 2018

RemoteClusterConnectionTests#testCloseWhileConcurrentlyConnecting leaks a connection #33756

Closed

original-brownbear added 2 commits October 30, 2018 11:02

Merge remote-tracking branch 'elastic/master' into 34990

543e206

CR: synchronize connection opening

7f5cd24

CR: simplify

fb843f5

fix unused import

62c5f8c

DaveCTurner approved these changes Oct 30, 2018

View reviewed changes

original-brownbear added 2 commits October 30, 2018 16:59

Merge remote-tracking branch 'elastic/master' into 34990

aa70ea7

CR: Adjust comment

d648d2c

CR: Add wait back

ce879da

Tim-Brooks approved these changes Oct 31, 2018

View reviewed changes

original-brownbear merged commit e6f9f06 into elastic:master Oct 31, 2018

original-brownbear deleted the 34990 branch October 31, 2018 20:33

original-brownbear added the backport pending label Oct 31, 2018

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

original-brownbear removed the backport pending label Feb 8, 2019

NETWORKING: MockTransportService Wait for Close #35038

NETWORKING: MockTransportService Wait for Close #35038

Uh oh!

Conversation

original-brownbear commented Oct 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticmachine commented Oct 29, 2018

Uh oh!

Tim-Brooks left a comment

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Oct 30, 2018

Uh oh!

DaveCTurner commented Oct 30, 2018

Uh oh!

original-brownbear commented Oct 30, 2018

Uh oh!

DaveCTurner commented Oct 30, 2018

Uh oh!

original-brownbear commented Oct 30, 2018

Uh oh!

DaveCTurner commented Oct 30, 2018

Uh oh!

original-brownbear commented Oct 30, 2018

Uh oh!

DaveCTurner left a comment

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Oct 30, 2018

Uh oh!

DaveCTurner commented Oct 30, 2018

Uh oh!

original-brownbear commented Oct 30, 2018

Uh oh!

Tim-Brooks commented Oct 30, 2018

Uh oh!

original-brownbear commented Oct 30, 2018

Uh oh!

DaveCTurner commented Oct 30, 2018

Uh oh!

Tim-Brooks left a comment

Choose a reason for hiding this comment

Uh oh!

original-brownbear commented Oct 31, 2018

Uh oh!

Uh oh!

original-brownbear commented Oct 29, 2018 •

edited

Loading