Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible finagle-chirper crash #106

Open
axel22 opened this issue May 2, 2019 · 14 comments
Open

Possible finagle-chirper crash #106

axel22 opened this issue May 2, 2019 · 14 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@axel22
Copy link
Member

axel22 commented May 2, 2019

Some users reported seeing this on some machines:

Exception in thread "Thread-938" Failure(connection timed out: localhost/127.0.0.1:51834 at remote address: localhost/127.0.0.1:51834. Remote Info: Not Available, flags=0x08) with RemoteInfo -> Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: localhost/127.0.0.1:51834, Downstream label: :51834, Trace Id: 3b8504004e328e76.3b8504004e328e76<:3b8504004e328e76 with Service -> :51834
Caused by: com.twitter.finagle.ConnectionFailedException: connection timed out: localhost/127.0.0.1:51834 at remote address: localhost/127.0.0.1:51834. Remote Info: Not Available
	at com.twitter.finagle.netty4.ConnectionBuilder$$anon$1.operationComplete(ConnectionBuilder.scala:99)
	at com.twitter.finagle.netty4.ConnectionBuilder$$anon$1.operationComplete(ConnectionBuilder.scala:78)
	at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511)
	at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504)
	at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483)
	at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424)
	at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:121)
	at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$2.run(AbstractEpollChannel.java:570)
	at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:127)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:335)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at com.twitter.finagle.util.BlockingTimeTrackingThreadFactory$$anon$1.run(BlockingTimeTrackingThreadFactory.scala:23)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.channel.ConnectTimeoutException: connection timed out: localhost/127.0.0.1:51834
	at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$2.run(AbstractEpollChannel.java:568)
	... 11 more
@axel22
Copy link
Member Author

axel22 commented May 2, 2019

cc @Fithos

@Fithos
Copy link
Collaborator

Fithos commented May 2, 2019

I wasn't able to reproduce this error on Linux and MacOS.
Could you please post more details on the configuration and installed OS of the machines where finagle-chirper fails?

@axel22
Copy link
Member Author

axel22 commented May 2, 2019

Unfortunately, neither was I on my machine. I am in contact with Tom Rodriguez who reported it, who sees it happening on some cluster machines (apparently it's an issue that happens both on HotSpot and on GraalVM).
cc @tkrodriguez

@tkrodriguez
Copy link
Collaborator

I've really only seen this on a fairly large machine. It's an Oracle x5-2 with 72 cores running Oracle Linux Server release 6.8. i've also seen problems with running out of file descriptors though that was with an older version of this. The limit is 65536 so it would have to be leaking fds pretty badly for this to occur. Anything I can try to help with debugging?

@farquet farquet added the bug Something isn't working label May 8, 2019
@ceresek
Copy link
Collaborator

ceresek commented May 9, 2019

The benchmark has a leaking file descriptor issue. Example exception dump:

Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.GeneratedConstructorAccessor3.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at io.netty.channel.ReflectiveChannelFactory.newChannel(ReflectiveChannelFactory.java:38)
... 172 more
Caused by: io.netty.channel.ChannelException: io.netty.channel.unix.Errors$NativeIoException: newSocketStream(..) failed: Too many open files
at io.netty.channel.unix.Socket.newSocketStream0(Socket.java:439)
at io.netty.channel.epoll.LinuxSocket.newSocketStream(LinuxSocket.java:184)
at io.netty.channel.epoll.EpollSocketChannel.(EpollSocketChannel.java:45)
... 176 more
Caused by: io.netty.channel.unix.Errors$NativeIoException: newSocketStream(..) failed: Too many open files
at io.netty.channel.unix.Errors.newIOException(Errors.java:122)
... 179 more

@ceresek
Copy link
Collaborator

ceresek commented May 10, 2019

Seen more instances of this on regular machines (e.g. measly 8 core 64GB RAM). At the time of the crash, the benchmark has quite a lot of TCP sockets open. Sometimes, this makes the benchmark hang instead of crash, complicating measurements.

@ceresek
Copy link
Collaborator

ceresek commented May 10, 2019

... also see constantly increasing number of threads (right now ~300 threads at iteration 400), if it helps, most thread instances are named UnboundedFuturePool-Something.

@axel22
Copy link
Member Author

axel22 commented May 10, 2019

@Fithos Can you confirm the file descriptor issue, and thread growth?

@ceresek
Copy link
Collaborator

ceresek commented May 10, 2019

finagle-chirper

@axel22
Copy link
Member Author

axel22 commented May 10, 2019

It's somewhat surprising that the thread count seems to be converging.

But, also, it drops at the very end, and the tearDownAfterAll is a likely reason, I think. Perhaps we should change the benchmark so that it kills the server instance completely between the iterations.

@vkostyukov
Copy link
Collaborator

Maybe it's because master isn't being properly closed here? https://github.com/renaissance-benchmarks/renaissance/blob/master/benchmarks/twitter-finagle/src/main/scala/org/renaissance/twitter/finagle/FinagleChirper.scala#L430-L433

I'm also not seeing when other master is closed.

Clients in Finagle are designed to be long-lived and potentially shared across your application. They operate better and better the longer you keep them around (connections are materialized, a past performance about errors/latencies is recorded, etc).

@axel22
Copy link
Member Author

axel22 commented May 14, 2019

Hi Vladimir! Thanks for the pointer!
So, essentially, this could be fixed by calling close on the Service object, correct?

@vkostyukov
Copy link
Collaborator

That's right, @axel22. Just call .close() if you don't need the client anymore. Creating a single client instance and sharing it in your benchmark would be even better workload as it's essentially what traditional Finagle applications do.

@albertnetymk
Copy link

Quoting from #168, "However, I hit #106 a lot locally on my MacBook. So this issue was apparently not fully resolved (cc @Fithos )."

So, the issue is still there, and should be reopened?

@farquet farquet reopened this Jul 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants