BulkProcessor can deadlock when bulk requests fail

If the [BulkProcessor](https://github.com/elastic/elasticsearch/blob/v6.8.3/server/src/main/java/org/elasticsearch/action/bulk/BulkProcessor.java) results in failed bulk requests, they will be retried via the [RetryHandler](https://github.com/elastic/elasticsearch/blob/v6.8.3/server/src/main/java/org/elasticsearch/action/bulk/Retry.java#L143). In versions of Elasticsearch prior to 7.3.0 this can result in a deadlock. 

The deadlock can happen due to the `Scheduler` which is shared between the `Flush` and `Retry` logic. The deadlock can happen because the `Scheduler` is configured with 1 worker thread which can be blocked in the `Flush` method. The `Flush` method is guarded by [synchronized (BulkProcessor.this)](https://github.com/elastic/elasticsearch/blob/v6.8.3/server/src/main/java/org/elasticsearch/action/bulk/BulkProcessor.java#L408) , but the [internalAdd(..)](https://github.com/elastic/elasticsearch/blob/v6.8.3/server/src/main/java/org/elasticsearch/action/bulk/BulkProcessor.java#L324) method is also blocked by the same (synchronized) monitor lock. What can happen is that a bulk request comes in through `internalAdd` obtains the lock, the bulk request is sent and a failure occurs, so the retry logic kicks in.  The scheduler thread is blocked in the `Flush` method due to the `internalAdd`'s hold on the synchronized block , so now when the retry attempts to schedule a retry, it can not because the `Flush` is blocking the only worker thread for the scheduler. So here `Flush` can not continue because it is waiting on `internalAdd` to finish, and `internalAdd` can not finish because it waiting on `Retry`, but `Retry` can not finish because it is waiting on a scheduler thread which it can not obtain because it is waiting on `Flush` to finish. 

The change in 7.3.0 fixes this issue because it is much more selective about exactly what is locked, and no longer wraps the execution of the bulk request inside the lock. 

Until `7.3.0` the only workaround is to set the [BackOffPolicy](https://github.com/elastic/elasticsearch/blob/v6.8.3/server/src/main/java/org/elasticsearch/action/bulk/BackoffPolicy.java) to `BackoffPolicy.noBackoff()` so that the `Retry` does not kick in. The default backoff is `BackoffPolicy.exponentialBackoff()` which is used by Watcher and is not configurable and thus is susceptible to this bug pre `7.3.0`

This is related to [https://github.com/elastic/elasticsearch/pull/41418](https://github.com/elastic/elasticsearch/pull/41418) butthe fix for that does not fix this issue. This issue is fixed by [https://github.com/elastic/elasticsearch/pull/41451](https://github.com/elastic/elasticsearch/pull/41451) in 7.3.0.

EDIT: see below for a case where this can still happen in [7.3.0-[7.5.0](https://github.com/elastic/elasticsearch/pull/48697)) (_if the flush documents themselves fail_)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BulkProcessor can deadlock when bulk requests fail #47599

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

BulkProcessor can deadlock when bulk requests fail #47599

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions