Description
Elasticsearch version: 2.4.0
Elasticsearch java rest client version: 5.2.2
Plugins installed: []
JVM version (java -version
): 1.8.0_131
OS version (uname -a
if on a Unix-like system): Ubuntu 16.04
Description of the problem including expected versus actual behavior:
Default SniffonfailureListener on rest client blocks the HTTPAsyncClient reactor thread when request encounters a java.net.ConnectException
Steps to reproduce:
- Have two es nodes and let sniffer pick them up
- Shut down one node
- Client tries to connect to that node --> fails --> tries to sniff and hangs till maxRetryTimeoutMillis
The failed callback triggers the sniffer https://github.com/elastic/elasticsearch/blob/master/client/rest/src/main/java/org/elasticsearch/client/RestClient.java#L374
However, the failed callback is being handled by the reactor thread of the underlying HttpAsyncClient. Since, the sniffer does a blocking performRequest
using the same client instance and the HttpClient can't handle the request because the reactor thread is blocked, its effectively a deadlock till the SyncResponselistener timeout of maxRetryTimeoutMillis
and no requests can be served at all during this time period. 😰
I found a similar issue https://issues.apache.org/jira/browse/HTTPCLIENT-1805 where the suggestion is to avoid potentially blocking or long running operations in the callbacks and more so in the failed callback since it could block the reactor thread.
I guess the solution would be to trigger the retries as well as sniffer on a separate threadpool internal to the RestClient so that the HttpClient's dispatcher and reactor threads are freed up asap.