Description
I'd like to propose a rare_terms
aggregation that collects the set of terms which have n
or fewer occurrences. This is essentially the opposite of "top-n" queries.
The motivation is that today, the only way to accurately collect the "lowest-n" is to execute a terms aggregation with size: INT_MAX
so that all terms are collected, then filter client-side or with bucket_selector
. This is obviously expensive: it requires the memory overhead of all the terms, plus executing leaf aggregations on all terms despite only caring about the "lowest-n".
Sorting by count ascending on the terms agg without setting size INT_MAX is trappy and we'd love to remove it :)
Algorithm
The algorithm uses a map and a bloom filter. The map tracks the total set of rare terms. The bloom filter tracks the set of terms > max_doc_count.
Pseudocode:
Map<String, int> terms = new Map();
BloomFilter bloom = new BloomFilter();
function collect(term) {
// Check to see if this term is in our bloom filter
if (bloom.get(term) == false) {
int value = map.get(term);
if (value == null) {
// We've never seen this term before, initialize it's counter to 1
map.put(term, 1);
} else {
value += 1;
// We've seen this term before, but less than the threshold
// so just increment its counter
if (value < max_doc_count) {
map.put(term, value);
} else {
// Otherwise we've breached the threshold, remove from
// the map and add to the bloom filter
map.remove(term);
bloom.add(term)
}
}
} else {
/*
There is no else clause here, because inclusion in the bloom filter
means the item has been seen > max_doc_count times and so is no longer
a "rare" term
Note that due to the nature of bloom filters, there will be some false-positives,
which translates into not adding an item to the "rare list" when it should
*/
}
}
Note: this ignores all the technical bits about replaying cached docs, etc.
Some potential extensions:
- If
max_doc_count: 1
, we don't need to store the count and could instead just use a Set to further reduce the size - We could potentially be clever and not use a full int for each counter and instead do some bit-packing (e.g.
max_doc_count: 16
only needs 4 bits), but we wouldn't be able to use a nice, convenient map :)
Properties
- Size is linear to the number of "rare" terms, plus some constant overhead.
- Antagonistic input (majority of terms are "rare") is still undeniably expensive, but still better than a terms aggregation
- Worst case where all terms are rare is basically equivalent to a terms agg with
size: INT_MAX
- Best we can tell, in all cases it is no worse than a terms aggregation in space or time.
- Must be executed as a deferred bucket agg (e.g.
collect_mode: breadth_first
)- As a benefit, leaf aggregations are only executed on the actually rare terms
- Resulting rare terms are guaranteed to satisfy
max_doc_count
(e.g. no false positives) - Some terms may be rare but not represented in the list (e.g. has false negatives)
max_doc_count
threshold is configurable, although the tradeoff between this and theterms
agg becomes less clear-cut asmax_doc_count
increases (e.g.max_doc_count: 1000
will likely include a large portion of a zipfian distribution, so aterms
agg coming from the other direction may be better)
Potentially this is sufficient enough that #17614 can be unblocked, and we can make the terms agg less trappy while providing better accuracy / space / time on rare term aggregations.
I'm going to start poking at an implementation.
/cc @colings86 @jpountz