Skip to content

netty: improve flushing and object allocations in write queue. #1989

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

buchgr
Copy link
Contributor

@buchgr buchgr commented Jun 28, 2016

This PR contains two separate changes:

  1. Instead of flushing every 128 writes, we now (effectively) only flush after having written every element in the write queue. This will allow us to better utilize gathering write system calls (e.g. on Linux writev can write 1024 buffers per default). This change only affects WriteQueue.flush().
  2. Merge all the command objects with QueuedCommand so to save one object allocation per write.

I ll add some more tests tomorrow.

@buchgr
Copy link
Contributor Author

buchgr commented Jun 28, 2016

Running the NoiseClient from netty/netty#5360 with this change, we write up to 15k buffers per flush, compared to 256 buffers before that.

@buchgr
Copy link
Contributor Author

buchgr commented Jun 28, 2016

private final StreamIdHolder stream;
private final Http2Headers headers;
private final boolean endOfStream;

private ChannelPromise promise;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're extending the base class - I think this promise is unused.

@nmittler
Copy link
Member

@buchgr great idea! Took a first pass.

while (queue.drainTo(writeChunk, DEQUE_CHUNK_SIZE) > 0) {
while (writeChunk.size() > 0) {
writeChunk.poll().run();
writesBeforeFlush -= writeChunk.size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would this be simpler as a for loop over queue? I see that there is a lock, but it seems like it would make more sense to atomic compare and swap the current queue with a shadow queue, and avoid the copy altogether.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carl-mastrangelo well even when swapping the queues, each poll() would still have to acquire a lock. I think that's what @louiscryan had in mind there. I wouldn't expect the lock to be contented, as offer and poll use two separate locks in the LinkedBlockingQueue.

Nevertheless, an interesting idea, which we would have to experiment with. I would prefer to do that in a separate PR, as this one is really focused on the flushing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Atomic compare and swap of the queue would be broken since producers first read the variable and then write to it. That is a recipe for races.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are atomic swapping the queues, they don't need to be blocking right? They can be the regular kind

You don't have to do it for this PR, but I think the idea is still worth exploring. There may not even need to be multiple Queues, just one that each thread steals when it needs to do work.

- Instead of flushing in chunks of 128 elements, we attempt to always drain the whole write queue before calling flush.
- Merge all the command objects with QueuedCommand so to save one object allocation per write.
@buchgr buchgr force-pushed the writequeue branch 2 times, most recently from 2e39a03 to fb8454b Compare June 29, 2016 13:53
@buchgr
Copy link
Contributor Author

buchgr commented Jun 29, 2016

PTAL

@nmittler
Copy link
Member

@buchgr LGTM ... make it so! :)

@carl-mastrangelo
Copy link
Contributor

@buchgr LGTM

For the future, any performance improvement PRs should probably have a benchmark and some numbers associated with them. That way when we look back on the blame history, we will know what we gained by making the PR.

channel.write(cmd, cmd.promise());
}
if (writesBeforeFlush <= 0) {
writesBeforeFlush = min(queue.size(), MAX_WRITES_BEFORE_FLUSH);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This min() seems strange, especially since queue.size() can increase while processing. Why not just use MAX_WRITES_BEFORE_FLUSH directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ejona86 excellent point. yeah this code got more complicated than need be. I initially just had queue.size() and then figured it would be a good idea to also cap and added this code. I ll just use MAX_WRITES_BEFORE_FLUSH. That way the flushed parameter will make more sense again too.

@buchgr
Copy link
Contributor Author

buchgr commented Jun 29, 2016

@carl-mastrangelo I have been benchmarking this change quite a bit.

The below benchmark gives us some 25% LESS throughput with this change. On my MBP over loopback. Eventhough we are doing less syscalls (checked with dtrace). My theory is that this is due to the extremely low latency of the loopback device (tens of micros). I experimented with reducing the batchsize and as a result I get higher throughput. On my system I seem to get best results, when reducing the batch size to 32 😅.

./qps_client --client_payload=20 --server_payload=40 
--outstanding_rpcs=2000 --channels=1 --streaming_rpcs 
--directexecutor --address=localhost:8080

./qps_server --directexecutor --address=localhost:8080

I am looking into using a network emulator to introduce some latency and such and very my theory. Thoughts?

@carl-mastrangelo
Copy link
Contributor

@buchgr I actually rewrote NoiseClient and noticed that it is is infact sending 1024 buffers at a time, so I think our testing methodologies are actually different.

One other thing that might be worth thinking about: Allocating a lot of arrays before flushing increases likely hood of lots of cache misses. If each of those 1024 buffers end up in different cache lines, doing a batch of requests will hurt (memory is slow!).

I am not familiar with dtrace, but does it tell you the time each syscall takes with some precision? Is the slowdown linear as you increase the number of buffers? You could confirm this by running sudo perf stat -e L1-dcache-load-misses $CMD on linux. (found using sudo perf list)

ChannelFuture enqueue(Object command, ChannelPromise promise, boolean flush) {
queue.add(new QueuedCommand(command, promise));
ChannelFuture enqueue(QueuedCommand command, ChannelPromise promise, boolean flush) {
// Detect errornous code that tries to reuse command objects.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

erroneous?

@buchgr
Copy link
Contributor Author

buchgr commented Jun 29, 2016

@buchgr I actually rewrote NoiseClient and noticed that it is is infact sending 1024 buffers at a time, so I think our testing methodologies are actually different.

can you please share this code? I wonder how this is possible, given that we currently flush every 128 writes to the grpc handler. Each write there will translate to 1-3 ByteBuf's, I guess. Also, how did you verify that we are sending 1024 buffers at a time? Note that it's not nioBuffers.length, but nioBufferCnt that contains the actual number of buffers Netty writes (just saying, cause I made that mistake initially to wrongly convince myself that we are writing 1024 buffers). I can see that Netty would write 1024 per buffer, once writes start to fail cause buffers piling up in ChannelOutboundBuffer.

I ll respond to your other points in about an hour. Need to take a short break. Can't think. Been on this since 7am (9pm here) 😄

@carl-mastrangelo
Copy link
Contributor

https://gist.github.com/carl-mastrangelo/725ca884fa1f7a99a4b86f50bf89205c

I see you are right, nioBufferCnt is the salient one. Sorry!

@buchgr
Copy link
Contributor Author

buchgr commented Jul 1, 2016

I did a lot of benchmarking with this change and by using a network emulator (netem), I found that roughly the higher the network latency the higher the throughput of this change. However, it only starts to get faster at ~10ms network latency. Which is a lot. I would assume 0.1ms or less for latency in data center for example.

One thing I noticed is that this change makes the channel's writability change a lot, while with current master it gets never changed. Netty's Http2ConnectionHandler does expensive stuff when the writability gets changed.

Also as found in #1996, writev with many buffers be quite slow.

I ll close this for now and open a PR without the flushing changes, as less objects and tests seem like a win either way.

@ejona86
Copy link
Member

ejona86 commented Jul 8, 2016

FYI: I've found netem to be very bad for testing since it does not drop packets and triggers very bad bufferbloat which impacts latency. 10ms is not all that high if you consider inter-continental networks, although in that case I'd be surprised if we couldn't saturate the network.

@carl-mastrangelo
Copy link
Contributor

carl-mastrangelo commented Jul 8, 2016

@buchgr I'd still like to have the optimized flushing, even if it only helps a little. It would definitely sate my wondering if its worth doing, if it was already done.

@lock lock bot locked as resolved and limited conversation to collaborators Jan 21, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants