Audit Logging on the IO Thread Appears to Cause Instability

While investigating some test slowness/instability today I found that we are running audit logging (including flushing logs to disk) on the transport/IO-thread (the one(s) where the network `.select` calls happen).
This caused the select loop(s) to get blocked for an extended period of time every now and then leading to periods where the network IO was frozen. This is especially visible when running large bulk requests (in the test in question a bulk request of size 10k) (https://github.com/elastic/elasticsearch/issues/39575).

I would argue that this is a bug and blocking disk IO shouldn't be happening on the network IO thread as it can lead to unpredictable latency. One side effect of this, could be (we observed this once) that the introduced latency blocks the IO thread for long enough to make SSL handshakes fail.
It seems this was indirectly raised in https://github.com/elastic/elasticsearch/issues/34321 but not investigated.

Stacktrace from Yourkit:
[Stacks.txt](https://github.com/elastic/elasticsearch/files/2926906/Stacks.txt)

cc @DaveCTurner @dimitris-athanasiou 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Audit Logging on the IO Thread Appears to Cause Instability #39658

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Audit Logging on the IO Thread Appears to Cause Instability #39658

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions