Description
While investigating some test slowness/instability today I found that we are running audit logging (including flushing logs to disk) on the transport/IO-thread (the one(s) where the network .select
calls happen).
This caused the select loop(s) to get blocked for an extended period of time every now and then leading to periods where the network IO was frozen. This is especially visible when running large bulk requests (in the test in question a bulk request of size 10k) (#39575).
I would argue that this is a bug and blocking disk IO shouldn't be happening on the network IO thread as it can lead to unpredictable latency. One side effect of this, could be (we observed this once) that the introduced latency blocks the IO thread for long enough to make SSL handshakes fail.
It seems this was indirectly raised in #34321 but not investigated.
Stacktrace from Yourkit:
Stacks.txt