Skip to content

Distinct Terms Aggregations #23818

Closed
Closed
@gingerwizard

Description

@gingerwizard

Ticket to request a new terms aggregation capable of identifying distinct values in a field, based on a query restriction. This aggregation would return values for a field in the matching document set, that do not exist in the non matching documents. This can be used to answer questions such as "Give me the values for a field which are new in the last N minutes".

This feature request comes out of use of Watcher to detect new values in fields - a common requirement in security analytics. Problems such as "lateral movement in communications" (user logs onto a new server) or "new process started on a server" currently require multiple query executions. Currently this has led to multiple implementations of watches, with varying degrees of efficiency.

Noting the following approach as discussions with @clintongormley indicate this is the current best approach for detecting new values in a field:

  • Identify values in the matching set - X
  • Query with values X using a terms query. This identifies common values in the non matching and matching document sets - Y
  • Script which identifies values in X NOT in Y.

An example watch for detecting a new process can be found here.

This specific watch looks for values in the last N mins for a field, identifying those which are "new".

Given the frequency of this requirement in Watcher, @clintongormley suggested potentially implementing this in ES. The above approach is efficient, provided the background set X is small. In most cases, we believe the set of matching documents would be smaller than those than do not match - thus the above approach is likely the most efficient. Pending discussion as to whether this can be implemented at a node or shard level within ES. #12316 may enable this to be implemented.
Would require careful document that the aggregation is performant provided the matching set is small. We may want to also narrow the requirement to detecting "new" values using time only - with the restriction within the aggregation.

cc @sarwarbhuiyanm @LucaWintergerst @mikeh-elastic @MikePaquette have encountered similar requirements.
@skearns64

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions