Description
Requirement
As part of making #6796 more modular, there is a general need to limit analysis of documents to a subset of the top-matching results based on
- Quantity - some aggregations can be expensive to run on many documents so we need to cap the numbers
- Quality - the result set of a "fuzzy" search has a long-tail of low-quality results which we want to ignore
An example problem that requires capping is in a recommendation problem like this where the query can yield a large number of results and we want to cap the processing time of looking up background stats from disk by fixing the volumes of results we consider.
Solution
The aggregation would be placed as a parent aggregation to the child aggs that need filtering.
{
"query": {
"terms": {
"movie_id": [
46970,
3726
]
},
"aggs": {
"qualityFilter": {
"top_docs_filter": {
"shard_size": 1000
},
"aggregations": {
"recommendations": {
"significant_terms": {
"field": "movie_id"
}
}
}
}
}
}
}
The assumption is we would always rank the top N selections on a shard based on the query score but there are various parameters we can use to control "N":
- An absolute number e.g. "shard_size" : 100
- An absolute score e.g. "min_score" : 3.5
- A relative score i.e expressed as a percentage of max score: "relative_to_max_score" : "50%"
All of these options could be used simultaneously as an OR rule for capping results.
Option 3 would need to use a priority queue whose size is capped, either by a large sensible default or the setting chosen in 1.
Concerns
There is a general concern over whether this feature is a new top-level agg which is a documented part of the Query DSL (as proposed here) or an implementation detail for existing aggs best hidden from the DSL and end users. The concerns with this proposal of use as a top-level agg are:
Return value formatting
While a potentially useful tool during execution of a query it is questionable that we would want to see this aggregation as a nested container in the search results. If we do choose to keep it as a container in results we could return some stats e.g number of rejected docs.
Dangerous if forgotten
If you have a computationally expensive aggregation (e.g. a significant terms agg that needs to hit disk randomly) then you could argue it is a mistake to rely on end users configuring a separate parent agg to filter the volume of hits it processes. In the PR #6796 I made a conscious decision to couple the quality filter settings directly with the settings that enable use of the expensive disk-access mode. That way there's less chance of running a query-from-hell.
Clash with top_hits agg
If we go down the route of making this filter a stand-alone agg then for consistency's sake the existing top_hits agg should be changed to be nested under this. In fact, we can refactor its sort and pagination features into the functionality proposed by this filtering agg.
Conclusion
We need to decide if this functionality needs breaking out as a new agg feature in the DSL.
If we choose not to do this we need to at least have some DSL consistency and reuse of sampling logic that is defined on things like top_hits agg and signifcant_terms.