netty: improve flushing and object allocations in write queue. #1989

buchgr · 2016-06-28T00:11:28Z

This PR contains two separate changes:

Instead of flushing every 128 writes, we now (effectively) only flush after having written every element in the write queue. This will allow us to better utilize gathering write system calls (e.g. on Linux writev can write 1024 buffers per default). This change only affects WriteQueue.flush().
Merge all the command objects with QueuedCommand so to save one object allocation per write.

I ll add some more tests tomorrow.

buchgr · 2016-06-28T15:06:01Z

Running the NoiseClient from netty/netty#5360 with this change, we write up to 15k buffers per flush, compared to 256 buffers before that.

buchgr · 2016-06-28T15:06:14Z

@carl-mastrangelo @ejona86 @nmittler

nmittler · 2016-06-28T15:37:15Z

netty/src/main/java/io/grpc/netty/SendResponseHeadersCommand.java

  private final StreamIdHolder stream;
  private final Http2Headers headers;
  private final boolean endOfStream;

+  private ChannelPromise promise;


You're extending the base class - I think this promise is unused.

nmittler · 2016-06-28T15:48:04Z

@buchgr great idea! Took a first pass.

carl-mastrangelo · 2016-06-28T16:03:14Z

netty/src/main/java/io/grpc/netty/WriteQueue.java

      while (queue.drainTo(writeChunk, DEQUE_CHUNK_SIZE) > 0) {
-        while (writeChunk.size() > 0) {
-          writeChunk.poll().run();
+        writesBeforeFlush -= writeChunk.size();


would this be simpler as a for loop over queue? I see that there is a lock, but it seems like it would make more sense to atomic compare and swap the current queue with a shadow queue, and avoid the copy altogether.

@carl-mastrangelo well even when swapping the queues, each poll() would still have to acquire a lock. I think that's what @louiscryan had in mind there. I wouldn't expect the lock to be contented, as offer and poll use two separate locks in the LinkedBlockingQueue.

Nevertheless, an interesting idea, which we would have to experiment with. I would prefer to do that in a separate PR, as this one is really focused on the flushing.

Atomic compare and swap of the queue would be broken since producers first read the variable and then write to it. That is a recipe for races.

If you are atomic swapping the queues, they don't need to be blocking right? They can be the regular kind

You don't have to do it for this PR, but I think the idea is still worth exploring. There may not even need to be multiple Queues, just one that each thread steals when it needs to do work.

- Instead of flushing in chunks of 128 elements, we attempt to always drain the whole write queue before calling flush. - Merge all the command objects with QueuedCommand so to save one object allocation per write.

buchgr · 2016-06-29T13:53:13Z

PTAL

nmittler · 2016-06-29T15:07:27Z

@buchgr LGTM ... make it so! :)

carl-mastrangelo · 2016-06-29T16:36:14Z

@buchgr LGTM

For the future, any performance improvement PRs should probably have a benchmark and some numbers associated with them. That way when we look back on the blame history, we will know what we gained by making the PR.

ejona86 · 2016-06-29T16:53:55Z

netty/src/main/java/io/grpc/netty/WriteQueue.java

+          channel.write(cmd, cmd.promise());
+        }
+        if (writesBeforeFlush <= 0) {
+          writesBeforeFlush = min(queue.size(), MAX_WRITES_BEFORE_FLUSH);


This min() seems strange, especially since queue.size() can increase while processing. Why not just use MAX_WRITES_BEFORE_FLUSH directly?

@ejona86 excellent point. yeah this code got more complicated than need be. I initially just had queue.size() and then figured it would be a good idea to also cap and added this code. I ll just use MAX_WRITES_BEFORE_FLUSH. That way the flushed parameter will make more sense again too.

buchgr · 2016-06-29T17:47:47Z

@carl-mastrangelo I have been benchmarking this change quite a bit.

The below benchmark gives us some 25% LESS throughput with this change. On my MBP over loopback. Eventhough we are doing less syscalls (checked with dtrace). My theory is that this is due to the extremely low latency of the loopback device (tens of micros). I experimented with reducing the batchsize and as a result I get higher throughput. On my system I seem to get best results, when reducing the batch size to 32 😅.

./qps_client --client_payload=20 --server_payload=40 
--outstanding_rpcs=2000 --channels=1 --streaming_rpcs 
--directexecutor --address=localhost:8080

./qps_server --directexecutor --address=localhost:8080

I am looking into using a network emulator to introduce some latency and such and very my theory. Thoughts?

carl-mastrangelo · 2016-06-29T18:44:46Z

@buchgr I actually rewrote NoiseClient and noticed that it is is infact sending 1024 buffers at a time, so I think our testing methodologies are actually different.

One other thing that might be worth thinking about: Allocating a lot of arrays before flushing increases likely hood of lots of cache misses. If each of those 1024 buffers end up in different cache lines, doing a batch of requests will hurt (memory is slow!).

I am not familiar with dtrace, but does it tell you the time each syscall takes with some precision? Is the slowdown linear as you increase the number of buffers? You could confirm this by running sudo perf stat -e L1-dcache-load-misses $CMD on linux. (found using sudo perf list)

carl-mastrangelo · 2016-06-29T18:56:43Z

netty/src/main/java/io/grpc/netty/WriteQueue.java

-  ChannelFuture enqueue(Object command, ChannelPromise promise, boolean flush) {
-    queue.add(new QueuedCommand(command, promise));
+  ChannelFuture enqueue(QueuedCommand command, ChannelPromise promise, boolean flush) {
+    // Detect errornous code that tries to reuse command objects.


buchgr · 2016-06-29T18:58:00Z

@buchgr I actually rewrote NoiseClient and noticed that it is is infact sending 1024 buffers at a time, so I think our testing methodologies are actually different.

can you please share this code? I wonder how this is possible, given that we currently flush every 128 writes to the grpc handler. Each write there will translate to 1-3 ByteBuf's, I guess. Also, how did you verify that we are sending 1024 buffers at a time? Note that it's not nioBuffers.length, but nioBufferCnt that contains the actual number of buffers Netty writes (just saying, cause I made that mistake initially to wrongly convince myself that we are writing 1024 buffers). I can see that Netty would write 1024 per buffer, once writes start to fail cause buffers piling up in ChannelOutboundBuffer.

I ll respond to your other points in about an hour. Need to take a short break. Can't think. Been on this since 7am (9pm here) 😄

carl-mastrangelo · 2016-06-29T19:10:10Z

https://gist.github.com/carl-mastrangelo/725ca884fa1f7a99a4b86f50bf89205c

I see you are right, nioBufferCnt is the salient one. Sorry!

buchgr · 2016-07-01T11:25:41Z

I did a lot of benchmarking with this change and by using a network emulator (netem), I found that roughly the higher the network latency the higher the throughput of this change. However, it only starts to get faster at ~10ms network latency. Which is a lot. I would assume 0.1ms or less for latency in data center for example.

One thing I noticed is that this change makes the channel's writability change a lot, while with current master it gets never changed. Netty's Http2ConnectionHandler does expensive stuff when the writability gets changed.

Also as found in #1996, writev with many buffers be quite slow.

I ll close this for now and open a PR without the flushing changes, as less objects and tests seem like a win either way.

ejona86 · 2016-07-08T19:58:10Z

FYI: I've found netem to be very bad for testing since it does not drop packets and triggers very bad bufferbloat which impacts latency. 10ms is not all that high if you consider inter-continental networks, although in that case I'd be surprised if we couldn't saturate the network.

carl-mastrangelo · 2016-07-08T20:01:18Z

@buchgr I'd still like to have the optimized flushing, even if it only helps a little. It would definitely sate my wondering if its worth doing, if it was already done.

buchgr force-pushed the writequeue branch 3 times, most recently from 9a62442 to 985c66f Compare June 28, 2016 11:01

buchgr mentioned this pull request Jun 28, 2016

Slow http2 framer when under heavy load netty/netty#5360

Open

nmittler reviewed Jun 28, 2016
View reviewed changes

carl-mastrangelo reviewed Jun 28, 2016
View reviewed changes

netty: improve flushing and object allocations in write queue.

f776068

- Instead of flushing in chunks of 128 elements, we attempt to always drain the whole write queue before calling flush. - Merge all the command objects with QueuedCommand so to save one object allocation per write.

buchgr force-pushed the writequeue branch 2 times, most recently from 2e39a03 to fb8454b Compare June 29, 2016 13:53

address comments and add tests

4a4559c

buchgr force-pushed the writequeue branch from fb8454b to 4a4559c Compare June 29, 2016 13:58

ejona86 reviewed Jun 29, 2016
View reviewed changes

address eric's comments

c069ce9

carl-mastrangelo reviewed Jun 29, 2016
View reviewed changes

buchgr closed this Jul 1, 2016

buchgr mentioned this pull request Jul 1, 2016

netty: reduce allocations in write queue. #2000

Merged

buchgr mentioned this pull request Jul 9, 2016

netty: Improve flusing behaviour in WriteQueue. #2027

Open

lock bot locked as resolved and limited conversation to collaborators Jan 21, 2019

netty: improve flushing and object allocations in write queue. #1989

netty: improve flushing and object allocations in write queue. #1989

Uh oh!

Conversation

buchgr commented Jun 28, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

buchgr commented Jun 28, 2016

Uh oh!

buchgr commented Jun 28, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nmittler commented Jun 28, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

buchgr commented Jun 29, 2016

Uh oh!

nmittler commented Jun 29, 2016

Uh oh!

carl-mastrangelo commented Jun 29, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

buchgr commented Jun 29, 2016

Uh oh!

carl-mastrangelo commented Jun 29, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

buchgr commented Jun 29, 2016

Uh oh!

carl-mastrangelo commented Jun 29, 2016

Uh oh!

buchgr commented Jul 1, 2016

Uh oh!

ejona86 commented Jul 8, 2016

Uh oh!

carl-mastrangelo commented Jul 8, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

buchgr commented Jun 28, 2016 •

edited

Loading

carl-mastrangelo commented Jul 8, 2016 •

edited

Loading