Possible finagle-chirper crash #106

axel22 · 2019-05-02T10:43:45Z

Some users reported seeing this on some machines:

Exception in thread "Thread-938" Failure(connection timed out: localhost/127.0.0.1:51834 at remote address: localhost/127.0.0.1:51834. Remote Info: Not Available, flags=0x08) with RemoteInfo -> Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: localhost/127.0.0.1:51834, Downstream label: :51834, Trace Id: 3b8504004e328e76.3b8504004e328e76<:3b8504004e328e76 with Service -> :51834
Caused by: com.twitter.finagle.ConnectionFailedException: connection timed out: localhost/127.0.0.1:51834 at remote address: localhost/127.0.0.1:51834. Remote Info: Not Available
	at com.twitter.finagle.netty4.ConnectionBuilder$$anon$1.operationComplete(ConnectionBuilder.scala:99)
	at com.twitter.finagle.netty4.ConnectionBuilder$$anon$1.operationComplete(ConnectionBuilder.scala:78)
	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)
	at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504)
	at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483)
	at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424)
	at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:121)
	at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$2.run(AbstractEpollChannel.java:570)
	at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:127)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:335)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at com.twitter.finagle.util.BlockingTimeTrackingThreadFactory$$anon$1.run(BlockingTimeTrackingThreadFactory.scala:23)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.channel.ConnectTimeoutException: connection timed out: localhost/127.0.0.1:51834
	at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$2.run(AbstractEpollChannel.java:568)
	... 11 more

The text was updated successfully, but these errors were encountered:

axel22 · 2019-05-02T11:14:12Z

cc @Fithos

Fithos · 2019-05-02T13:42:36Z

I wasn't able to reproduce this error on Linux and MacOS.
Could you please post more details on the configuration and installed OS of the machines where finagle-chirper fails?

axel22 · 2019-05-02T14:08:09Z

Unfortunately, neither was I on my machine. I am in contact with Tom Rodriguez who reported it, who sees it happening on some cluster machines (apparently it's an issue that happens both on HotSpot and on GraalVM).
cc @tkrodriguez

tkrodriguez · 2019-05-02T18:50:10Z

I've really only seen this on a fairly large machine. It's an Oracle x5-2 with 72 cores running Oracle Linux Server release 6.8. i've also seen problems with running out of file descriptors though that was with an older version of this. The limit is 65536 so it would have to be leaking fds pretty badly for this to occur. Anything I can try to help with debugging?

ceresek · 2019-05-09T13:50:46Z

The benchmark has a leaking file descriptor issue. Example exception dump:

Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.GeneratedConstructorAccessor3.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at io.netty.channel.ReflectiveChannelFactory.newChannel(ReflectiveChannelFactory.java:38)
... 172 more
Caused by: io.netty.channel.ChannelException: io.netty.channel.unix.Errors$NativeIoException: newSocketStream(..) failed: Too many open files
at io.netty.channel.unix.Socket.newSocketStream0(Socket.java:439)
at io.netty.channel.epoll.LinuxSocket.newSocketStream(LinuxSocket.java:184)
at io.netty.channel.epoll.EpollSocketChannel.(EpollSocketChannel.java:45)
... 176 more
Caused by: io.netty.channel.unix.Errors$NativeIoException: newSocketStream(..) failed: Too many open files
at io.netty.channel.unix.Errors.newIOException(Errors.java:122)
... 179 more

ceresek · 2019-05-10T06:18:44Z

Seen more instances of this on regular machines (e.g. measly 8 core 64GB RAM). At the time of the crash, the benchmark has quite a lot of TCP sockets open. Sometimes, this makes the benchmark hang instead of crash, complicating measurements.

ceresek · 2019-05-10T07:26:03Z

... also see constantly increasing number of threads (right now ~300 threads at iteration 400), if it helps, most thread instances are named UnboundedFuturePool-Something.

axel22 · 2019-05-10T10:49:00Z

@Fithos Can you confirm the file descriptor issue, and thread growth?

ceresek · 2019-05-10T10:52:39Z

axel22 · 2019-05-10T10:53:32Z

It's somewhat surprising that the thread count seems to be converging.

But, also, it drops at the very end, and the tearDownAfterAll is a likely reason, I think. Perhaps we should change the benchmark so that it kills the server instance completely between the iterations.

vkostyukov · 2019-05-11T18:03:30Z

Maybe it's because master isn't being properly closed here? https://github.com/renaissance-benchmarks/renaissance/blob/master/benchmarks/twitter-finagle/src/main/scala/org/renaissance/twitter/finagle/FinagleChirper.scala#L430-L433

I'm also not seeing when other master is closed.

Clients in Finagle are designed to be long-lived and potentially shared across your application. They operate better and better the longer you keep them around (connections are materialized, a past performance about errors/latencies is recorded, etc).

axel22 · 2019-05-14T00:46:18Z

Hi Vladimir! Thanks for the pointer!
So, essentially, this could be fixed by calling close on the Service object, correct?

vkostyukov · 2019-05-14T01:53:11Z

That's right, @axel22. Just call .close() if you don't need the client anymore. Creating a single client instance and sharing it in your benchmark would be even better workload as it's essentially what traditional Finagle applications do.

albertnetymk · 2019-07-05T12:23:50Z

Quoting from #168, "However, I hit #106 a lot locally on my MacBook. So this issue was apparently not fully resolved (cc @Fithos )."

So, the issue is still there, and should be reopened?

farquet added the bug Something isn't working label May 8, 2019

Fithos mentioned this issue May 14, 2019

Fix finagle-chirper #136

Merged

farquet added this to the 1.0.0 milestone May 15, 2019

farquet assigned Fithos Jun 4, 2019

farquet closed this as completed Jun 4, 2019

farquet mentioned this issue Jul 1, 2019

Fix for finagle-chirper locking problem #168

Merged

farquet reopened this Jul 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible finagle-chirper crash #106

Possible finagle-chirper crash #106

axel22 commented May 2, 2019

axel22 commented May 2, 2019

Fithos commented May 2, 2019

axel22 commented May 2, 2019

tkrodriguez commented May 2, 2019

ceresek commented May 9, 2019

ceresek commented May 10, 2019

ceresek commented May 10, 2019

axel22 commented May 10, 2019

ceresek commented May 10, 2019

axel22 commented May 10, 2019

vkostyukov commented May 11, 2019

axel22 commented May 14, 2019

vkostyukov commented May 14, 2019

albertnetymk commented Jul 5, 2019

Possible finagle-chirper crash #106

Possible finagle-chirper crash #106

Comments

axel22 commented May 2, 2019

axel22 commented May 2, 2019

Fithos commented May 2, 2019

axel22 commented May 2, 2019

tkrodriguez commented May 2, 2019

ceresek commented May 9, 2019

ceresek commented May 10, 2019

ceresek commented May 10, 2019

axel22 commented May 10, 2019

ceresek commented May 10, 2019

axel22 commented May 10, 2019

vkostyukov commented May 11, 2019

axel22 commented May 14, 2019

vkostyukov commented May 14, 2019

albertnetymk commented Jul 5, 2019