Description
If the BulkProcessor results in failed bulk requests, they will be retried via the RetryHandler. In versions of Elasticsearch prior to 7.3.0 this can result in a deadlock.
The deadlock can happen due to the Scheduler
which is shared between the Flush
and Retry
logic. The deadlock can happen because the Scheduler
is configured with 1 worker thread which can be blocked in the Flush
method. The Flush
method is guarded by synchronized (BulkProcessor.this) , but the internalAdd(..) method is also blocked by the same (synchronized) monitor lock. What can happen is that a bulk request comes in through internalAdd
obtains the lock, the bulk request is sent and a failure occurs, so the retry logic kicks in. The scheduler thread is blocked in the Flush
method due to the internalAdd
's hold on the synchronized block , so now when the retry attempts to schedule a retry, it can not because the Flush
is blocking the only worker thread for the scheduler. So here Flush
can not continue because it is waiting on internalAdd
to finish, and internalAdd
can not finish because it waiting on Retry
, but Retry
can not finish because it is waiting on a scheduler thread which it can not obtain because it is waiting on Flush
to finish.
The change in 7.3.0 fixes this issue because it is much more selective about exactly what is locked, and no longer wraps the execution of the bulk request inside the lock.
Until 7.3.0
the only workaround is to set the BackOffPolicy to BackoffPolicy.noBackoff()
so that the Retry
does not kick in. The default backoff is BackoffPolicy.exponentialBackoff()
which is used by Watcher and is not configurable and thus is susceptible to this bug pre 7.3.0
This is related to #41418 butthe fix for that does not fix this issue. This issue is fixed by #41451 in 7.3.0.
EDIT: see below for a case where this can still happen in [7.3.0-7.5.0) (if the flush documents themselves fail)