Description
Computing the completion stats involves walking every field of every segment of every relevant shard, looking for completion fields. By default the seemingly-innocuous GET _stats
API does this for every shard in the cluster. I've seen more than a few cases where an external monitoring system is hitting an overly-broad stats API hard enough that the cluster can't keep up. The consequence is that these stats requests pile up in the management
threadpool and interfere with the other users of that threadpool.
As far as I can tell, these stats only change on a refresh. In most cases this means they do not change much at all, so I think we can improve the situation by caching these stats between refreshes.
I also note that in #33847 we changed the source of these stats from the external searcher to the internal one. I'm not sure why - external seems more appropriate to me, and would help with the caching since external refreshes may be very infrequent indeed.
Relates:
- completionStats are slow #36773
- https://discuss.elastic.co/t/bulk-indexing-causes-management-threadpool-queue-to-skyrocket/217852/7
- https://discuss.elastic.co/t/monitor-tasks-piling-up-in-master-node-6-8-3/212547
- https://discuss.elastic.co/t/concurrent-indices-stats-requests-cause-cluster-to-go-red/161385
- https://discuss.elastic.co/t/high-cpu-usage-while-bulk-indexing/135023/10