-
Notifications
You must be signed in to change notification settings - Fork 3.9k
DNM: netty: add handler to combine small writes into larger buffer #1996
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
You also need to ensure you release / flush the buffer when the channel is closed or the handler is removed
|
|
||
public class WriteCombiningHandler extends ChannelOutboundHandlerAdapter { | ||
|
||
private static final int BUFFER_SIZE = 4096; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic numbers deserve documentation. In fact, you may want to try testing with different values. 8k would be interesting, as well as powers of two down to maybe 512.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A common pattern in gRPC is the following write pattern:
little, little, little, BIG.
A larger buffer doesn't help a lot in that case, since you are going to flush pretty much immediately after the the first few writes. For example, connection initialization and headers would be the small buffers, and the payload of the RPC the big one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carl-mastrangelo yes this is just a PoC for this particular benchmark where I know that we only write small buffers. As mentioned in the PR message, we will need better heuristics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carl-mastrangelo I tried several different sizes. 4K seems to be close to ideal for this benchmark :-).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try it with a larger payload too? I would imagine a 4k payload would make for a more interesting Benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carl-mastrangelo yes! My plan is to now experiment with the write
and writev
system call directly in a little C program and see how they behave.
@Override | ||
public void flush(ChannelHandlerContext ctx) throws Exception { | ||
if (buffer != null && buffer.readableBytes() > 0) { | ||
ctx.write(buffer, ctx.voidPromise()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe consider slicing the buffer if we still have capacity?
I am currently busy with exams :( ... there is some thing happening in Netty land though. See netty/netty#5510 and netty/netty#5516 |
Some thoughts so that I don't forget.
|
We may want the handler to release all buffers when it receives a flush(), since retaining a 16K buffer per idle connection seems a bit much. |
@louiscryan @ejona86 @nmittler @normanmaurer I am having some trouble designing the benchmark with JMH. I want to measure My idea to circumvent this problem was to instead measure the duration of each sequence of Another idea is to have two eventloops. One for writing and the other for reading. Set the I/O ratio of the write eventloop to 100 and then write just as many (or fewer) bytes as the send buffer can hold. Measure the the time it took from writing the first buffer to the last and from that compute a throughput value (i.e. 10MB in 200ms => 50MB/s). However, that also seems a bit hacked. Any thoughts on how to implement a benchmark properly measuring throughput of channel writes + flush? |
TL;DR: In selected QPS benchmarks the write combining is 13%, 5%, 10% faster. Current master
With this change
|
@buchgr Those numbers look pretty good. Could you also run the FlowControlledMessagesPerSecondBenchmark test before and after? preferably with a large (10?) number of forks. |
import java.util.ArrayList; | ||
import java.util.List; | ||
|
||
public class WriteCombiningHandler extends ChannelOutboundHandlerAdapter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@buchgr do you plan to push this to netty ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@normanmaurer yeah sure ... once I get it right :-). The problem I still have is that the heuristic of when to combine is still arbitrary. I would ideally like a benchmark that can tell me somewhat accurately, where the sweetspot lies. Any thoughts on this?
Opened a clean PR #2146 |
This is a proof of concept that combines smaller
ByteBufs
into a largerByteBuf
by copying. The idea came out of a discussion with @normanmaurer today, where he dropped his wisdom on me :-). Amongst others he mentioned that passing many small buffers towritev
can often be much slower than one big buffer towrite
. I decided to give it a try and implemented aWriteCombiningHandler
that simply copies many smaller buffers into a bigger buffer. See the benchmark below.Now, this is still quite early and I have to experiment and benchmark a lot more. Also, this handler does not interact well with a channel's writability and promise's aren't implemented correctly (yet). Also, the handler is quite dumb in that it just combines all writes. We could instead only combine buffers less than N bytes, and have different sized buffers to optimize for memory usage (i.e. don't allocate a 4KB buffer to merge 5 writes totalling 250 bytes). Many more sophisticated heuristics are conceivable.
Furthermore, @normanmaurer mentioned that if real performance benefits can be shown, this capability should be added to Netty's
ChannelOutboundBuffer
.Best out of 3 runs:
Before:
After:
Thoughts? @ejona86 @nmittler @carl-mastrangelo @louiscryan