Description
I've been digging into the "merges can fall behind" at high indexing
rates, and I discovered some serious issues with the IO throttling,
which we recently (#5902) up'd from 20 MB/sec to 50 MB/sec by default.
Net/net I think when we ask for 50 MB/sec today we are really
throttling at something like 8 MB/sec!
Details:
I indexed a bunch of small log-file type docs into 1 shard, 0
replicas, using 1 sync _bulk client, to the point where it did it's
first big-ish merge (611 MB, 440K docs); the merge does not use CFS so
it's really writing 611 MB. I'm using a fast SSD.
With no throttling (index.store.throttle.type=none), the merge takes
20.8 seconds.
With the default 50 MB/sec merge throttling, it takes 72.1 sec, which
far too long (611 MB / 50 = 12.2 sec). The rate limiter enforces the
instantaneous rate, so at worse the merge time should have been 20.8 +
12.2 = 33 sec but likely much less than that because merging takes
CPU time.
So I dug in and discovered one problem, I think caused by the
super.flush and then delegate.flush in BufferedChecksumIndexOutput,
where the RateLimiter is always alternately called first on 8192 bytes
then on 0 bytes. If I fix RateLimiter to just ignore those 0 bytes,
the merge time with 50 MB/sec throttle drops to 49.9 sec: better, but
still too long. (I think once we cutover to Lucene's checksums this 0
byte issue will be fixed?)
System.nanoTime is actually quite costly, so I suspect the overhead of
just checking whether to pause, and of calling Thread.sleep, is way
too much when the pause time is small. So I change SimpleRateLimiter to
just accumulate the incoming bytes and then once it crosses 1 msec
worth at the specified rate, invoke the pause logic.
This really improved it: now the merge takes 25.7 sec at 50 MB/sec
throttle, and 64.9 sec at 10 MB/sec throttle. These times seem correct.
I'll also open a Lucene issue to fix this, and make an XRateLimiter
for ES in the meantime.