Skip to content

"Rare Terms" aggregation #20586

Closed
Closed
@polyfractal

Description

@polyfractal

I'd like to propose a rare_terms aggregation that collects the set of terms which have n or fewer occurrences. This is essentially the opposite of "top-n" queries.

The motivation is that today, the only way to accurately collect the "lowest-n" is to execute a terms aggregation with size: INT_MAX so that all terms are collected, then filter client-side or with bucket_selector. This is obviously expensive: it requires the memory overhead of all the terms, plus executing leaf aggregations on all terms despite only caring about the "lowest-n".

Sorting by count ascending on the terms agg without setting size INT_MAX is trappy and we'd love to remove it :)

Algorithm

The algorithm uses a map and a bloom filter. The map tracks the total set of rare terms. The bloom filter tracks the set of terms > max_doc_count.

Pseudocode:

Map<String, int> terms = new Map();
BloomFilter bloom = new BloomFilter();

function collect(term) {

  // Check to see if this term is in our bloom filter
  if (bloom.get(term) == false) {
    int value = map.get(term);
    if (value == null) {

      // We've never seen this term before, initialize it's counter to 1
      map.put(term, 1);
    } else {

      value += 1;

      // We've seen this term before, but less than the threshold
      // so just increment its counter
      if (value < max_doc_count) {
        map.put(term, value);
      } else {
        // Otherwise we've breached the threshold, remove from
        // the map and add to the bloom filter
        map.remove(term);
        bloom.add(term)
      }
    }
  } else {
    /* 
      There is no else clause here, because inclusion in the bloom filter
       means the item has been seen > max_doc_count times and so is no longer
       a "rare" term

       Note that due to the nature of bloom filters, there will be some false-positives,
       which translates into not adding an item to the "rare list" when it should
    */
  }

}

Note: this ignores all the technical bits about replaying cached docs, etc.

Some potential extensions:

  • If max_doc_count: 1, we don't need to store the count and could instead just use a Set to further reduce the size
  • We could potentially be clever and not use a full int for each counter and instead do some bit-packing (e.g. max_doc_count: 16 only needs 4 bits), but we wouldn't be able to use a nice, convenient map :)

Properties

  • Size is linear to the number of "rare" terms, plus some constant overhead.
    • Antagonistic input (majority of terms are "rare") is still undeniably expensive, but still better than a terms aggregation
    • Worst case where all terms are rare is basically equivalent to a terms agg with size: INT_MAX
    • Best we can tell, in all cases it is no worse than a terms aggregation in space or time.
  • Must be executed as a deferred bucket agg (e.g. collect_mode: breadth_first)
    • As a benefit, leaf aggregations are only executed on the actually rare terms
  • Resulting rare terms are guaranteed to satisfy max_doc_count (e.g. no false positives)
  • Some terms may be rare but not represented in the list (e.g. has false negatives)
  • max_doc_count threshold is configurable, although the tradeoff between this and the terms agg becomes less clear-cut as max_doc_count increases (e.g. max_doc_count: 1000 will likely include a large portion of a zipfian distribution, so a terms agg coming from the other direction may be better)

Potentially this is sufficient enough that #17614 can be unblocked, and we can make the terms agg less trappy while providing better accuracy / space / time on rare term aggregations.

I'm going to start poking at an implementation.

/cc @colings86 @jpountz

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions