Closed
Description
Currently we do not have circuit breaker support for search requests executed on the coordinating node. We have multi-phase reduction which should help avoid OOMs but it is still possible to have abusive queries taking a node down.
A recent example OOM was caused by date histograms with 5 minute intervals executed across many time-based indices. Each of the data nodes failed to trip a circuit breaker because they were only seeing a small part of the final result. The multi-phase reduction did nothing to reduce the final number of buckets required and the final OOM occurred while rendering results in toXContent
. This scenario was exacerbated by the fact there was a top-level terms agg for hostname
under which there were the date histograms.