Skip to content

BulkProcessor can deadlock when bulk requests fail #47599

Closed
@jakelandis

Description

@jakelandis

If the BulkProcessor results in failed bulk requests, they will be retried via the RetryHandler. In versions of Elasticsearch prior to 7.3.0 this can result in a deadlock.

The deadlock can happen due to the Scheduler which is shared between the Flush and Retry logic. The deadlock can happen because the Scheduler is configured with 1 worker thread which can be blocked in the Flush method. The Flush method is guarded by synchronized (BulkProcessor.this) , but the internalAdd(..) method is also blocked by the same (synchronized) monitor lock. What can happen is that a bulk request comes in through internalAdd obtains the lock, the bulk request is sent and a failure occurs, so the retry logic kicks in. The scheduler thread is blocked in the Flush method due to the internalAdd's hold on the synchronized block , so now when the retry attempts to schedule a retry, it can not because the Flush is blocking the only worker thread for the scheduler. So here Flush can not continue because it is waiting on internalAdd to finish, and internalAdd can not finish because it waiting on Retry, but Retry can not finish because it is waiting on a scheduler thread which it can not obtain because it is waiting on Flush to finish.

The change in 7.3.0 fixes this issue because it is much more selective about exactly what is locked, and no longer wraps the execution of the bulk request inside the lock.

Until 7.3.0 the only workaround is to set the BackOffPolicy to BackoffPolicy.noBackoff() so that the Retry does not kick in. The default backoff is BackoffPolicy.exponentialBackoff() which is used by Watcher and is not configurable and thus is susceptible to this bug pre 7.3.0

This is related to #41418 butthe fix for that does not fix this issue. This issue is fixed by #41451 in 7.3.0.

EDIT: see below for a case where this can still happen in [7.3.0-7.5.0) (if the flush documents themselves fail)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions