Skip to content

Audit Logging on the IO Thread Appears to Cause Instability #39658

Open
@original-brownbear

Description

@original-brownbear

While investigating some test slowness/instability today I found that we are running audit logging (including flushing logs to disk) on the transport/IO-thread (the one(s) where the network .select calls happen).
This caused the select loop(s) to get blocked for an extended period of time every now and then leading to periods where the network IO was frozen. This is especially visible when running large bulk requests (in the test in question a bulk request of size 10k) (#39575).

I would argue that this is a bug and blocking disk IO shouldn't be happening on the network IO thread as it can lead to unpredictable latency. One side effect of this, could be (we observed this once) that the introduced latency blocks the IO thread for long enough to make SSL handshakes fail.
It seems this was indirectly raised in #34321 but not investigated.

Stacktrace from Yourkit:
Stacks.txt

cc @DaveCTurner @dimitris-athanasiou

Metadata

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions