-
Notifications
You must be signed in to change notification settings - Fork 3.9k
netty: improve flushing and object allocations in write queue. #1989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
9a62442
to
985c66f
Compare
Running the |
private final StreamIdHolder stream; | ||
private final Http2Headers headers; | ||
private final boolean endOfStream; | ||
|
||
private ChannelPromise promise; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're extending the base class - I think this promise is unused.
@buchgr great idea! Took a first pass. |
while (queue.drainTo(writeChunk, DEQUE_CHUNK_SIZE) > 0) { | ||
while (writeChunk.size() > 0) { | ||
writeChunk.poll().run(); | ||
writesBeforeFlush -= writeChunk.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would this be simpler as a for loop over queue? I see that there is a lock, but it seems like it would make more sense to atomic compare and swap the current queue with a shadow queue, and avoid the copy altogether.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carl-mastrangelo well even when swapping the queues, each poll()
would still have to acquire a lock. I think that's what @louiscryan had in mind there. I wouldn't expect the lock to be contented, as offer
and poll
use two separate locks in the LinkedBlockingQueue
.
Nevertheless, an interesting idea, which we would have to experiment with. I would prefer to do that in a separate PR, as this one is really focused on the flushing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Atomic compare and swap of the queue would be broken since producers first read the variable and then write to it. That is a recipe for races.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are atomic swapping the queues, they don't need to be blocking right? They can be the regular kind
You don't have to do it for this PR, but I think the idea is still worth exploring. There may not even need to be multiple Queues, just one that each thread steals when it needs to do work.
- Instead of flushing in chunks of 128 elements, we attempt to always drain the whole write queue before calling flush. - Merge all the command objects with QueuedCommand so to save one object allocation per write.
2e39a03
to
fb8454b
Compare
PTAL |
@buchgr LGTM ... make it so! :) |
@buchgr LGTM For the future, any performance improvement PRs should probably have a benchmark and some numbers associated with them. That way when we look back on the blame history, we will know what we gained by making the PR. |
channel.write(cmd, cmd.promise()); | ||
} | ||
if (writesBeforeFlush <= 0) { | ||
writesBeforeFlush = min(queue.size(), MAX_WRITES_BEFORE_FLUSH); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This min()
seems strange, especially since queue.size() can increase while processing. Why not just use MAX_WRITES_BEFORE_FLUSH
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ejona86 excellent point. yeah this code got more complicated than need be. I initially just had queue.size() and then figured it would be a good idea to also cap and added this code. I ll just use MAX_WRITES_BEFORE_FLUSH
. That way the flushed
parameter will make more sense again too.
@carl-mastrangelo I have been benchmarking this change quite a bit. The below benchmark gives us some 25% LESS throughput with this change. On my MBP over loopback. Eventhough we are doing less syscalls (checked with dtrace). My theory is that this is due to the extremely low latency of the loopback device (tens of micros). I experimented with reducing the batchsize and as a result I get higher throughput. On my system I seem to get best results, when reducing the batch size to 32 😅.
I am looking into using a network emulator to introduce some latency and such and very my theory. Thoughts? |
@buchgr I actually rewrote NoiseClient and noticed that it is is infact sending 1024 buffers at a time, so I think our testing methodologies are actually different. One other thing that might be worth thinking about: Allocating a lot of arrays before flushing increases likely hood of lots of cache misses. If each of those 1024 buffers end up in different cache lines, doing a batch of requests will hurt (memory is slow!). I am not familiar with dtrace, but does it tell you the time each syscall takes with some precision? Is the slowdown linear as you increase the number of buffers? You could confirm this by running |
ChannelFuture enqueue(Object command, ChannelPromise promise, boolean flush) { | ||
queue.add(new QueuedCommand(command, promise)); | ||
ChannelFuture enqueue(QueuedCommand command, ChannelPromise promise, boolean flush) { | ||
// Detect errornous code that tries to reuse command objects. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
erroneous?
can you please share this code? I wonder how this is possible, given that we currently flush every 128 writes to the grpc handler. Each write there will translate to 1-3 ByteBuf's, I guess. Also, how did you verify that we are sending 1024 buffers at a time? Note that it's not I ll respond to your other points in about an hour. Need to take a short break. Can't think. Been on this since 7am (9pm here) 😄 |
https://gist.github.com/carl-mastrangelo/725ca884fa1f7a99a4b86f50bf89205c I see you are right, nioBufferCnt is the salient one. Sorry! |
I did a lot of benchmarking with this change and by using a network emulator (netem), I found that roughly the higher the network latency the higher the throughput of this change. However, it only starts to get faster at ~10ms network latency. Which is a lot. I would assume 0.1ms or less for latency in data center for example. One thing I noticed is that this change makes the channel's writability change a lot, while with current master it gets never changed. Netty's Also as found in #1996, I ll close this for now and open a PR without the flushing changes, as less objects and tests seem like a win either way. |
FYI: I've found netem to be very bad for testing since it does not drop packets and triggers very bad bufferbloat which impacts latency. 10ms is not all that high if you consider inter-continental networks, although in that case I'd be surprised if we couldn't saturate the network. |
@buchgr I'd still like to have the optimized flushing, even if it only helps a little. It would definitely sate my wondering if its worth doing, if it was already done. |
This PR contains two separate changes:
writev
can write 1024 buffers per default). This change only affects WriteQueue.flush().QueuedCommand
so to save one object allocation per write.I ll add some more tests tomorrow.