Skip to content

DNM: netty: add handler to combine small writes into larger buffer #1996

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

buchgr
Copy link
Contributor

@buchgr buchgr commented Jun 30, 2016

This is a proof of concept that combines smaller ByteBufs into a larger ByteBuf by copying. The idea came out of a discussion with @normanmaurer today, where he dropped his wisdom on me :-). Amongst others he mentioned that passing many small buffers to writev can often be much slower than one big buffer to write. I decided to give it a try and implemented a WriteCombiningHandler that simply copies many smaller buffers into a bigger buffer. See the benchmark below.

Now, this is still quite early and I have to experiment and benchmark a lot more. Also, this handler does not interact well with a channel's writability and promise's aren't implemented correctly (yet). Also, the handler is quite dumb in that it just combines all writes. We could instead only combine buffers less than N bytes, and have different sized buffers to optimize for memory usage (i.e. don't allocate a 4KB buffer to merge 5 writes totalling 250 bytes). Many more sophisticated heuristics are conceivable.

Furthermore, @normanmaurer mentioned that if real performance benefits can be shown, this capability should be added to Netty's ChannelOutboundBuffer.

./qps_client --client_payload=20 --server_payload=40 --outstanding_rpcs=2000 
--channels=1 --streaming_rpcs --directexecutor --address=localhost:8080
./qps_server --directexecutor --address=localhost:8080

Best out of 3 runs:

Before:

Channels:                       1
Outstanding RPCs per Channel:   2000
Server Payload Size:            40
Client Payload Size:            20
50%ile Latency (in micros):     10367
90%ile Latency (in micros):     13743
95%ile Latency (in micros):     14703
99%ile Latency (in micros):     18255
99.9%ile Latency (in micros):   36159
Maximum Latency (in micros):    129471
QPS:                            189670

After:

Channels:                       1
Outstanding RPCs per Channel:   2000
Server Payload Size:            40
Client Payload Size:            20
50%ile Latency (in micros):     9007
90%ile Latency (in micros):     10799
95%ile Latency (in micros):     11895
99%ile Latency (in micros):     13495
99.9%ile Latency (in micros):   22431
Maximum Latency (in micros):    107327
QPS:                            217794

Thoughts? @ejona86 @nmittler @carl-mastrangelo @louiscryan

@normanmaurer
Copy link

You also need to ensure you release / flush the buffer when the channel is closed or the handler is removed

Am 30.06.2016 um 16:52 schrieb Jakob Buchgraber notifications@github.com:

This is a proof of concept that combines smaller ByteBufs into a larger ByteBuf by copying. The idea came out of a discussion with @normanmaurer today, where he dropped his wisdom on me :-). Amongst others he mentioned that passing many small buffers to writev can often be much slower than one big buffer to write. I decided to give it a try and implemented a WriteCombiningHandler that simply copies many smaller buffers into a bigger buffer. See the benchmark below.

Now, this is still quite early and I have to experiment and benchmark a lot more. Also, this handler does not interact well with a channel's writability and promise's aren't implemented correctly (yet). Also, the handler is quite dumb in that it just combines all writes. We could instead only combine buffers less than N bytes, and have different sized buffers to optimize for memory usage (i.e. don't allocate a 4KB buffer to merge 5 writes totalling 250 bytes). Many more sophisticated heuristics are conceivable.

Furthermore, @normanmaurer mentioned that if real performance benefits can be shown, this capability should be added to Netty's ChannelOutboundBuffer.

./qps_client --client_payload=20 --server_payload=40 --outstanding_rpcs=2000
--channels=1 --streaming_rpcs --directexecutor --address=localhost:8080
./qps_server --directexecutor --address=localhost:8080
Best out of 3 runs:

Before:

Channels: 1
Outstanding RPCs per Channel: 2000
Server Payload Size: 40
Client Payload Size: 20
50%ile Latency (in micros): 9799
90%ile Latency (in micros): 11751
95%ile Latency (in micros): 12911
99%ile Latency (in micros): 14959
99.9%ile Latency (in micros): 23727
Maximum Latency (in micros): 145919
QPS: 199364
After:

Channels: 1
Outstanding RPCs per Channel: 2000
Server Payload Size: 40
Client Payload Size: 20
50%ile Latency (in micros): 9007
90%ile Latency (in micros): 10799
95%ile Latency (in micros): 11895
99%ile Latency (in micros): 13495
99.9%ile Latency (in micros): 22431
Maximum Latency (in micros): 107327
QPS: 217794
Thoughts? @ejona86 @nmittler @carl-mastrangelo @louiscryan

You can view, comment on, or merge this pull request online at:

#1996

Commit Summary

netty: add handler to combine small writes into larger buffer
File Changes

M netty/src/main/java/io/grpc/netty/NettyClientHandler.java (6)
M netty/src/main/java/io/grpc/netty/NettyServerHandler.java (1)
A netty/src/main/java/io/grpc/netty/WriteCombiningHandler.java (82)
Patch Links:

https://github.com/grpc/grpc-java/pull/1996.patch
https://github.com/grpc/grpc-java/pull/1996.diff

You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.


public class WriteCombiningHandler extends ChannelOutboundHandlerAdapter {

private static final int BUFFER_SIZE = 4096;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic numbers deserve documentation. In fact, you may want to try testing with different values. 8k would be interesting, as well as powers of two down to maybe 512.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A common pattern in gRPC is the following write pattern:

little, little, little, BIG.

A larger buffer doesn't help a lot in that case, since you are going to flush pretty much immediately after the the first few writes. For example, connection initialization and headers would be the small buffers, and the payload of the RPC the big one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carl-mastrangelo yes this is just a PoC for this particular benchmark where I know that we only write small buffers. As mentioned in the PR message, we will need better heuristics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carl-mastrangelo I tried several different sizes. 4K seems to be close to ideal for this benchmark :-).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try it with a larger payload too? I would imagine a 4k payload would make for a more interesting Benchmark.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carl-mastrangelo yes! My plan is to now experiment with the write and writev system call directly in a little C program and see how they behave.

@Override
public void flush(ChannelHandlerContext ctx) throws Exception {
if (buffer != null && buffer.readableBytes() > 0) {
ctx.write(buffer, ctx.voidPromise());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe consider slicing the buffer if we still have capacity?

@buchgr
Copy link
Contributor Author

buchgr commented Jul 9, 2016

I am currently busy with exams :( ... there is some thing happening in Netty land though. See netty/netty#5510 and netty/netty#5516

@buchgr
Copy link
Contributor Author

buchgr commented Jul 12, 2016

Some thoughts so that I don't forget.

  • Allocate one large buffer (say 16K) and copy small buffers into it. Then slice it, write it and reuse the remaining buffer. We want to amortize allocations across many channel writes.
  • When should we stop copying? If we encounter a buffer larger than N bytes (to be determined).
  • Maybe we should only copy if there are at least M buffers less than N bytes? We might want to keep a running history of the last M buffers and then decide.
  • When to flush? If .flush() gets called or we accumulated enough bytes so that when adding more, the channel's writability would change. We want to avoid triggering a writability change and flush before. We can check that via Channel.unsafe().outboundBuffer().bytesBeforeUnwritable().
  • Need to benchmark write vs writev. Specifically, merge M buffers + write, vs writev M buffers for different buffer sizes.

@ejona86
Copy link
Member

ejona86 commented Jul 22, 2016

We may want the handler to release all buffers when it receives a flush(), since retaining a 16K buffer per idle connection seems a bit much.

@buchgr
Copy link
Contributor Author

buchgr commented Jul 30, 2016

@louiscryan @ejona86 @nmittler @normanmaurer

I am having some trouble designing the benchmark with JMH.

I want to measure write + flush performance in Netty. That is, do a bunch of writes and then flush. See if combining those writes to a single write and then flush is faster. Ideally, I would want to get something like like mbit/s. My first idea was to continuously enqueue a task on the eventloop that performs the writes + flush and via a future listener increments a counter by the flushed bytes. However, this approach is flawed in that it can't deal with writes failing (i.e. send buffer being full). When the send buffer is full, then writes will fail and be queued up in the outbound buffer. So essentially the faster we are writing the more likely that writes will fail.

My idea to circumvent this problem was to instead measure the duration of each sequence of writes + flush and store it in a histogram. I could then compute some kind of mean write + flush duration, by ignoring everything above the 90%tile or a few standard deviations from the median or so, effectively ignoring those writes that failed due to e.g. a full send buffer. I implemented this via aux counters in JMH, but that doesn't work cause JMH normalizes the aux counter values to ops/sec ... in this benchmark ops/sec is meaningless though. The benchmark method is just used to trigger the writes and then waits until all writes are complete (similar as done in the other gRPC JMH methods). Even if it worked with JMH, it's just really complex to interpret the result and I would rather have a single throughput figure instead of a histogram.

Another idea is to have two eventloops. One for writing and the other for reading. Set the I/O ratio of the write eventloop to 100 and then write just as many (or fewer) bytes as the send buffer can hold. Measure the the time it took from writing the first buffer to the last and from that compute a throughput value (i.e. 10MB in 200ms => 50MB/s). However, that also seems a bit hacked.

Any thoughts on how to implement a benchmark properly measuring throughput of channel writes + flush?

@buchgr
Copy link
Contributor Author

buchgr commented Aug 1, 2016

TL;DR: In selected QPS benchmarks the write combining is 13%, 5%, 10% faster.

Current master

Jakobs-MacBook-Pro-4:bin buchgr$ ./qps_client --client_payload=20 --server_payload=40 --outstanding_rpcs=100 --channels=1 --directexecutor --address=localhost:8080
Channels:                       1
Outstanding RPCs per Channel:   100
Server Payload Size:            40
Client Payload Size:            20
50%ile Latency (in micros):     1101
90%ile Latency (in micros):     1279
95%ile Latency (in micros):     1357
99%ile Latency (in micros):     1764
99.9%ile Latency (in micros):   2761
Maximum Latency (in micros):    22879
QPS:                            89280
Jakobs-MacBook-Pro-4:bin buchgr$ ./qps_client --client_payload=300 --server_payload=600 --outstanding_rpcs=100 --channels=1 --directexecutor --address=localhost:8080
Channels:                       1
Outstanding RPCs per Channel:   100
Server Payload Size:            600
Client Payload Size:            300
50%ile Latency (in micros):     1153
90%ile Latency (in micros):     1304
95%ile Latency (in micros):     1378
99%ile Latency (in micros):     1934
99.9%ile Latency (in micros):   2867
Maximum Latency (in micros):    16559
QPS:                            86529

// Streaming RPC
Jakobs-MacBook-Pro-4:bin buchgr$ ./qps_client --client_payload=20 --server_payload=40 --outstanding_rpcs=2000 --channels=1 --streaming_rpcs --directexecutor --address=localhost:8080
Channels:                       1
Outstanding RPCs per Channel:   2000
Server Payload Size:            40
Client Payload Size:            20
50%ile Latency (in micros):     10431
90%ile Latency (in micros):     12055
95%ile Latency (in micros):     12791
99%ile Latency (in micros):     14831
99.9%ile Latency (in micros):   24927
Maximum Latency (in micros):    99967
QPS:                            188915

With this change

Jakobs-MBP-4:bin buchgr$ ./qps_client --client_payload=20 --server_payload=40 --outstanding_rpcs=100 --channels=1 --directexecutor --address=localhost:8080
Channels:                       1
Outstanding RPCs per Channel:   100
Server Payload Size:            40
Client Payload Size:            20
50%ile Latency (in micros):     948
90%ile Latency (in micros):     1115
95%ile Latency (in micros):     1197
99%ile Latency (in micros):     1519
99.9%ile Latency (in micros):   2349
Maximum Latency (in micros):    38271
QPS:                            101665

Jakobs-MBP-4:bin buchgr$ ./qps_client --client_payload=300 --server_payload=600 --outstanding_rpcs=100 --channels=1 --directexecutor --address=localhost:8080
Channels:                       1
Outstanding RPCs per Channel:   100
Server Payload Size:            600
Client Payload Size:            300
50%ile Latency (in micros):     1063
90%ile Latency (in micros):     1246
95%ile Latency (in micros):     1355
99%ile Latency (in micros):     1704
99.9%ile Latency (in micros):   2691
Maximum Latency (in micros):    15863
QPS:                            90682

// Streaming RPC
Jakobs-MBP-4:bin buchgr$ ./qps_client --client_payload=20 --server_payload=40 --outstanding_rpcs=2000 --channels=1 --streaming_rpcs --directexecutor --address=localhost:8080
Channels:                       1
Outstanding RPCs per Channel:   2000
Server Payload Size:            40
Client Payload Size:            20
50%ile Latency (in micros):     9495
90%ile Latency (in micros):     11231
95%ile Latency (in micros):     12135
99%ile Latency (in micros):     13807
99.9%ile Latency (in micros):   19631
Maximum Latency (in micros):    85119
QPS:                            206568

@carl-mastrangelo
Copy link
Contributor

@buchgr Those numbers look pretty good. Could you also run the FlowControlledMessagesPerSecondBenchmark test before and after? preferably with a large (10?) number of forks.

import java.util.ArrayList;
import java.util.List;

public class WriteCombiningHandler extends ChannelOutboundHandlerAdapter {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@buchgr do you plan to push this to netty ?

Copy link
Contributor Author

@buchgr buchgr Aug 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@normanmaurer yeah sure ... once I get it right :-). The problem I still have is that the heuristic of when to combine is still arbitrary. I would ideally like a benchmark that can tell me somewhat accurately, where the sweetspot lies. Any thoughts on this?

@buchgr
Copy link
Contributor Author

buchgr commented Aug 9, 2016

Opened a clean PR #2146

@buchgr buchgr closed this Aug 9, 2016
@lock lock bot locked as resolved and limited conversation to collaborators Jan 21, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants