Skip to content

Internalize reconnection mechanics and introduce connection pooling #492

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 20, 2017

Conversation

jchambers
Copy link
Owner

@jchambers jchambers commented Jul 3, 2017

After several false starts, I'm pleased to share this work-in-progress attempt at channel pooling. In the past, the model was that a single ApnsClient would have a single connection to the APNs server. In this pull request, each client now has an ApnsChannelPool which manages access to one or more connections. Connections are created on demand, which eliminates the notion of explicitly connecting (or reconnecting) an ApnsClient.

Because the model is changing significantly, this change will introduce lots of breaking API changes. I'm very interested in all of your feedback.

Again, I emphasize that this is a work in progress, but I do think it should be fairly representative of the shape of things to come. I'm also hopeful that this will resolve the host of reconnection issues folks have been reporting lately. In particular, I believe this:

TODO:

  • Restore/update metrics
  • Restore exponential backoff when opening new connections
  • Add factory/pool tests
  • Document everything

@@ -52,6 +52,9 @@
@Param({"10000"})
public int notificationCount;

@Param({"1", "4", "8"})
public int concurrentConnections;
Copy link
Owner Author

@jchambers jchambers Jul 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the curious, early benchmarks show things to be maaaaaaaaybe slightly faster (it's pretty much in the noise) than they were before channel pooling with a single connection. Unsurprisingly, a single client's throughput goes up significantly when using multiple threads/connections.

@clslrns
Copy link

clslrns commented Jul 21, 2017

This one would be very useful.
I've managed to get only around 11K notifications/s from single connection with 8-threaded event loop group on machine with 8 cores. Neither network, nor CPU was fully loaded. So I switched to 8 single-threaded APNSClients: throughput improved to 35-44K notifications/s.

@jchambers
Copy link
Owner Author

Aside from the part where I think the Travis CI team is trolling me by constantly fixing and then un-fixing a weird hostname bug for OpenJDK 7, I think this is ready to roll. I'll sleep on it before I merge it, but anticipate a release some time tomorrow.

@jchambers jchambers force-pushed the channel_pool branch 2 times, most recently from f444a70 to 1dc2d7e Compare September 10, 2017 15:50
@jchambers
Copy link
Owner Author

For those of you waiting patiently for this change (and release): given the significance of the changes, we want to observe it in production for a little while on our end before we unleash it upon the wider world. I'm hoping for a release at the end of next week.

@jchambers
Copy link
Owner Author

Our work in #515 prompted some testing with very high levels of leak detection, which in turn increased load and slowed everything down. That, in turn, revealed some sporadic test failures when sending many notifications, which hints at a race condition. We're tracking it down, but it's something we'll need to resolve before the next release.

@jchambers
Copy link
Owner Author

Here's what I've figured out so far:

  • The "send many notifications" tests are sometimes failing because a "send" future fails because the stream was closed before the server sent a reply.
  • The stream appears to have been closed because the underlying channel was closed.
  • The underlying channel appears to have been closed because we tried to acquire it from the idle pool, but then noticed that it wasn't active.

The last part is really puzzling to me, and will be the focus of the investigation from here.

@jchambers
Copy link
Owner Author

In all cases of failure, the client sends a GOAWAY frame to the server, and the last stream created by the client is 143. This is shaping up to be either a spectacularly dumb or really interesting bug.

@jchambers
Copy link
Owner Author

jchambers commented Sep 18, 2017

Found it. We're attaching response promises to streams asynchronously in the client handler, and in some cases, we can get a reply from the server before that promise is in place, which causes a NullPointerException. This was likely also a problem in 0.10, but we just didn't notice it until now.

@jchambers
Copy link
Owner Author

Everything is fixed in ed1d764.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants