DNM: netty: add handler to combine small writes into larger buffer #1996

buchgr · 2016-06-30T14:51:51Z

This is a proof of concept that combines smaller ByteBufs into a larger ByteBuf by copying. The idea came out of a discussion with @normanmaurer today, where he dropped his wisdom on me :-). Amongst others he mentioned that passing many small buffers to writev can often be much slower than one big buffer to write. I decided to give it a try and implemented a WriteCombiningHandler that simply copies many smaller buffers into a bigger buffer. See the benchmark below.

Now, this is still quite early and I have to experiment and benchmark a lot more. Also, this handler does not interact well with a channel's writability and promise's aren't implemented correctly (yet). Also, the handler is quite dumb in that it just combines all writes. We could instead only combine buffers less than N bytes, and have different sized buffers to optimize for memory usage (i.e. don't allocate a 4KB buffer to merge 5 writes totalling 250 bytes). Many more sophisticated heuristics are conceivable.

Furthermore, @normanmaurer mentioned that if real performance benefits can be shown, this capability should be added to Netty's ChannelOutboundBuffer.

./qps_client --client_payload=20 --server_payload=40 --outstanding_rpcs=2000 
--channels=1 --streaming_rpcs --directexecutor --address=localhost:8080

./qps_server --directexecutor --address=localhost:8080

Best out of 3 runs:

Before:

Channels:                       1
Outstanding RPCs per Channel:   2000
Server Payload Size:            40
Client Payload Size:            20
50%ile Latency (in micros):     10367
90%ile Latency (in micros):     13743
95%ile Latency (in micros):     14703
99%ile Latency (in micros):     18255
99.9%ile Latency (in micros):   36159
Maximum Latency (in micros):    129471
QPS:                            189670

After:

Channels:                       1
Outstanding RPCs per Channel:   2000
Server Payload Size:            40
Client Payload Size:            20
50%ile Latency (in micros):     9007
90%ile Latency (in micros):     10799
95%ile Latency (in micros):     11895
99%ile Latency (in micros):     13495
99.9%ile Latency (in micros):   22431
Maximum Latency (in micros):    107327
QPS:                            217794

Thoughts? @ejona86 @nmittler @carl-mastrangelo @louiscryan

normanmaurer · 2016-06-30T14:55:51Z

You also need to ensure you release / flush the buffer when the channel is closed or the handler is removed

Am 30.06.2016 um 16:52 schrieb Jakob Buchgraber notifications@github.com:

This is a proof of concept that combines smaller ByteBufs into a larger ByteBuf by copying. The idea came out of a discussion with @normanmaurer today, where he dropped his wisdom on me :-). Amongst others he mentioned that passing many small buffers to writev can often be much slower than one big buffer to write. I decided to give it a try and implemented a WriteCombiningHandler that simply copies many smaller buffers into a bigger buffer. See the benchmark below.

Now, this is still quite early and I have to experiment and benchmark a lot more. Also, this handler does not interact well with a channel's writability and promise's aren't implemented correctly (yet). Also, the handler is quite dumb in that it just combines all writes. We could instead only combine buffers less than N bytes, and have different sized buffers to optimize for memory usage (i.e. don't allocate a 4KB buffer to merge 5 writes totalling 250 bytes). Many more sophisticated heuristics are conceivable.

Furthermore, @normanmaurer mentioned that if real performance benefits can be shown, this capability should be added to Netty's ChannelOutboundBuffer.

./qps_client --client_payload=20 --server_payload=40 --outstanding_rpcs=2000
--channels=1 --streaming_rpcs --directexecutor --address=localhost:8080
./qps_server --directexecutor --address=localhost:8080
Best out of 3 runs:

Before:

Channels: 1
Outstanding RPCs per Channel: 2000
Server Payload Size: 40
Client Payload Size: 20
50%ile Latency (in micros): 9799
90%ile Latency (in micros): 11751
95%ile Latency (in micros): 12911
99%ile Latency (in micros): 14959
99.9%ile Latency (in micros): 23727
Maximum Latency (in micros): 145919
QPS: 199364
After:

Channels: 1
Outstanding RPCs per Channel: 2000
Server Payload Size: 40
Client Payload Size: 20
50%ile Latency (in micros): 9007
90%ile Latency (in micros): 10799
95%ile Latency (in micros): 11895
99%ile Latency (in micros): 13495
99.9%ile Latency (in micros): 22431
Maximum Latency (in micros): 107327
QPS: 217794
Thoughts? @ejona86 @nmittler @carl-mastrangelo @louiscryan

You can view, comment on, or merge this pull request online at:

#1996

Commit Summary

netty: add handler to combine small writes into larger buffer
File Changes

M netty/src/main/java/io/grpc/netty/NettyClientHandler.java (6)
M netty/src/main/java/io/grpc/netty/NettyServerHandler.java (1)
A netty/src/main/java/io/grpc/netty/WriteCombiningHandler.java (82)
Patch Links:

https://github.com/grpc/grpc-java/pull/1996.patch
https://github.com/grpc/grpc-java/pull/1996.diff
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

carl-mastrangelo · 2016-06-30T16:25:15Z

netty/src/main/java/io/grpc/netty/WriteCombiningHandler.java

+
+public class WriteCombiningHandler extends ChannelOutboundHandlerAdapter {
+
+  private static final int BUFFER_SIZE = 4096;


Magic numbers deserve documentation. In fact, you may want to try testing with different values. 8k would be interesting, as well as powers of two down to maybe 512.

A common pattern in gRPC is the following write pattern:

little, little, little, BIG.

A larger buffer doesn't help a lot in that case, since you are going to flush pretty much immediately after the the first few writes. For example, connection initialization and headers would be the small buffers, and the payload of the RPC the big one.

@carl-mastrangelo yes this is just a PoC for this particular benchmark where I know that we only write small buffers. As mentioned in the PR message, we will need better heuristics.

@carl-mastrangelo I tried several different sizes. 4K seems to be close to ideal for this benchmark :-).

Try it with a larger payload too? I would imagine a 4k payload would make for a more interesting Benchmark.

@carl-mastrangelo yes! My plan is to now experiment with the write and writev system call directly in a little C program and see how they behave.

nmittler · 2016-07-01T14:29:27Z

netty/src/main/java/io/grpc/netty/WriteCombiningHandler.java

+  @Override
+  public void flush(ChannelHandlerContext ctx) throws Exception {
+    if (buffer != null && buffer.readableBytes() > 0) {
+      ctx.write(buffer, ctx.voidPromise());


Maybe consider slicing the buffer if we still have capacity?

buchgr · 2016-07-09T12:58:30Z

I am currently busy with exams :( ... there is some thing happening in Netty land though. See netty/netty#5510 and netty/netty#5516

buchgr · 2016-07-12T10:11:41Z

Some thoughts so that I don't forget.

Allocate one large buffer (say 16K) and copy small buffers into it. Then slice it, write it and reuse the remaining buffer. We want to amortize allocations across many channel writes.
When should we stop copying? If we encounter a buffer larger than N bytes (to be determined).
Maybe we should only copy if there are at least M buffers less than N bytes? We might want to keep a running history of the last M buffers and then decide.
When to flush? If .flush() gets called or we accumulated enough bytes so that when adding more, the channel's writability would change. We want to avoid triggering a writability change and flush before. We can check that via Channel.unsafe().outboundBuffer().bytesBeforeUnwritable().
Need to benchmark write vs writev. Specifically, merge M buffers + write, vs writev M buffers for different buffer sizes.

ejona86 · 2016-07-22T18:12:21Z

We may want the handler to release all buffers when it receives a flush(), since retaining a 16K buffer per idle connection seems a bit much.

buchgr · 2016-07-30T19:27:39Z

@louiscryan @ejona86 @nmittler @normanmaurer

I am having some trouble designing the benchmark with JMH.

I want to measure write + flush performance in Netty. That is, do a bunch of writes and then flush. See if combining those writes to a single write and then flush is faster. Ideally, I would want to get something like like mbit/s. My first idea was to continuously enqueue a task on the eventloop that performs the writes + flush and via a future listener increments a counter by the flushed bytes. However, this approach is flawed in that it can't deal with writes failing (i.e. send buffer being full). When the send buffer is full, then writes will fail and be queued up in the outbound buffer. So essentially the faster we are writing the more likely that writes will fail.

My idea to circumvent this problem was to instead measure the duration of each sequence of writes + flush and store it in a histogram. I could then compute some kind of mean write + flush duration, by ignoring everything above the 90%tile or a few standard deviations from the median or so, effectively ignoring those writes that failed due to e.g. a full send buffer. I implemented this via aux counters in JMH, but that doesn't work cause JMH normalizes the aux counter values to ops/sec ... in this benchmark ops/sec is meaningless though. The benchmark method is just used to trigger the writes and then waits until all writes are complete (similar as done in the other gRPC JMH methods). Even if it worked with JMH, it's just really complex to interpret the result and I would rather have a single throughput figure instead of a histogram.

Another idea is to have two eventloops. One for writing and the other for reading. Set the I/O ratio of the write eventloop to 100 and then write just as many (or fewer) bytes as the send buffer can hold. Measure the the time it took from writing the first buffer to the last and from that compute a throughput value (i.e. 10MB in 200ms => 50MB/s). However, that also seems a bit hacked.

Any thoughts on how to implement a benchmark properly measuring throughput of channel writes + flush?

buchgr · 2016-08-01T19:08:15Z

TL;DR: In selected QPS benchmarks the write combining is 13%, 5%, 10% faster.

Current master

Jakobs-MacBook-Pro-4:bin buchgr$ ./qps_client --client_payload=20 --server_payload=40 --outstanding_rpcs=100 --channels=1 --directexecutor --address=localhost:8080
Channels:                       1
Outstanding RPCs per Channel:   100
Server Payload Size:            40
Client Payload Size:            20
50%ile Latency (in micros):     1101
90%ile Latency (in micros):     1279
95%ile Latency (in micros):     1357
99%ile Latency (in micros):     1764
99.9%ile Latency (in micros):   2761
Maximum Latency (in micros):    22879
QPS:                            89280

Jakobs-MacBook-Pro-4:bin buchgr$ ./qps_client --client_payload=300 --server_payload=600 --outstanding_rpcs=100 --channels=1 --directexecutor --address=localhost:8080
Channels:                       1
Outstanding RPCs per Channel:   100
Server Payload Size:            600
Client Payload Size:            300
50%ile Latency (in micros):     1153
90%ile Latency (in micros):     1304
95%ile Latency (in micros):     1378
99%ile Latency (in micros):     1934
99.9%ile Latency (in micros):   2867
Maximum Latency (in micros):    16559
QPS:                            86529

// Streaming RPC
Jakobs-MacBook-Pro-4:bin buchgr$ ./qps_client --client_payload=20 --server_payload=40 --outstanding_rpcs=2000 --channels=1 --streaming_rpcs --directexecutor --address=localhost:8080
Channels:                       1
Outstanding RPCs per Channel:   2000
Server Payload Size:            40
Client Payload Size:            20
50%ile Latency (in micros):     10431
90%ile Latency (in micros):     12055
95%ile Latency (in micros):     12791
99%ile Latency (in micros):     14831
99.9%ile Latency (in micros):   24927
Maximum Latency (in micros):    99967
QPS:                            188915

With this change

Jakobs-MBP-4:bin buchgr$ ./qps_client --client_payload=20 --server_payload=40 --outstanding_rpcs=100 --channels=1 --directexecutor --address=localhost:8080
Channels:                       1
Outstanding RPCs per Channel:   100
Server Payload Size:            40
Client Payload Size:            20
50%ile Latency (in micros):     948
90%ile Latency (in micros):     1115
95%ile Latency (in micros):     1197
99%ile Latency (in micros):     1519
99.9%ile Latency (in micros):   2349
Maximum Latency (in micros):    38271
QPS:                            101665

Jakobs-MBP-4:bin buchgr$ ./qps_client --client_payload=300 --server_payload=600 --outstanding_rpcs=100 --channels=1 --directexecutor --address=localhost:8080
Channels:                       1
Outstanding RPCs per Channel:   100
Server Payload Size:            600
Client Payload Size:            300
50%ile Latency (in micros):     1063
90%ile Latency (in micros):     1246
95%ile Latency (in micros):     1355
99%ile Latency (in micros):     1704
99.9%ile Latency (in micros):   2691
Maximum Latency (in micros):    15863
QPS:                            90682

// Streaming RPC
Jakobs-MBP-4:bin buchgr$ ./qps_client --client_payload=20 --server_payload=40 --outstanding_rpcs=2000 --channels=1 --streaming_rpcs --directexecutor --address=localhost:8080
Channels:                       1
Outstanding RPCs per Channel:   2000
Server Payload Size:            40
Client Payload Size:            20
50%ile Latency (in micros):     9495
90%ile Latency (in micros):     11231
95%ile Latency (in micros):     12135
99%ile Latency (in micros):     13807
99.9%ile Latency (in micros):   19631
Maximum Latency (in micros):    85119
QPS:                            206568

carl-mastrangelo · 2016-08-01T20:32:28Z

@buchgr Those numbers look pretty good. Could you also run the FlowControlledMessagesPerSecondBenchmark test before and after? preferably with a large (10?) number of forks.

normanmaurer · 2016-08-01T21:19:50Z

netty/src/main/java/io/grpc/netty/WriteCombiningHandler.java

+import java.util.ArrayList;
+import java.util.List;
+
+public class WriteCombiningHandler extends ChannelOutboundHandlerAdapter {


@buchgr do you plan to push this to netty ?

@normanmaurer yeah sure ... once I get it right :-). The problem I still have is that the heuristic of when to combine is still arbitrary. I would ideally like a benchmark that can tell me somewhat accurately, where the sweetspot lies. Any thoughts on this?

buchgr · 2016-08-09T16:56:47Z

Opened a clean PR #2146

carl-mastrangelo reviewed Jun 30, 2016
View reviewed changes

buchgr mentioned this pull request Jul 1, 2016

netty: improve flushing and object allocations in write queue. #1989

Closed

nmittler reviewed Jul 1, 2016
View reviewed changes

buchgr mentioned this pull request Jul 12, 2016

netty: Improve flusing behaviour in WriteQueue. #2027

Open

buchgr added 3 commits July 28, 2016 18:32

netty: add handler to combine small writes into larger buffer

e8d109c

address norman's comment

0869883

wip

b09bde5

buchgr force-pushed the buffers branch from 368a571 to b09bde5 Compare July 28, 2016 16:57

buchgr added 2 commits July 29, 2016 22:57

some fixes

e3b87a9

fixes

2d0b374

normanmaurer reviewed Aug 1, 2016
View reviewed changes

some fixes

08c67f8

buchgr mentioned this pull request Aug 7, 2016

Zero-copy* write combining #2139

Open

buchgr closed this Aug 9, 2016

lock bot locked as resolved and limited conversation to collaborators Jan 21, 2019


		public class WriteCombiningHandler extends ChannelOutboundHandlerAdapter {

		private static final int BUFFER_SIZE = 4096;

DNM: netty: add handler to combine small writes into larger buffer #1996

DNM: netty: add handler to combine small writes into larger buffer #1996

Uh oh!

Conversation

buchgr commented Jun 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

normanmaurer commented Jun 30, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

buchgr commented Jul 9, 2016

Uh oh!

buchgr commented Jul 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ejona86 commented Jul 22, 2016

Uh oh!

buchgr commented Jul 30, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

buchgr commented Aug 1, 2016

Uh oh!

carl-mastrangelo commented Aug 1, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

buchgr Aug 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

buchgr commented Aug 9, 2016

Uh oh!

Uh oh!

buchgr commented Jun 30, 2016 •

edited

Loading

buchgr commented Jul 12, 2016 •

edited

Loading

buchgr commented Jul 30, 2016 •

edited

Loading

buchgr Aug 1, 2016 •

edited

Loading