Skip to content

[ML] Determine when data is missing from a bucket due to Ingest latency #35131

Closed
@benwtrent

Description

@benwtrent

Issue

When a Datafeed is configured, the end user provides a query_delay. At times this delay is too small and consequently, when the Datafeed pulls data from the index(es) data could be missed that has yet to be indexed.

We currently do a poor job of indicating if any data was missed and alerting the user to such.

Solution

A proposed solution is for a separate process in real-time Datafeeds to look at past finalized bucket(s) and compare the event_count with a the current actual count of documents for that bucket(s) time window and the user provided query.

To capture bucket discrepancies over an arbitrary number of buckets in the past, a date_histogram aggregation with interval=bucket_span. When this is used in conjunction with the Datafeed's query it allows us to have an accurate count for what the event_count SHOULD be given the current data in the index. Then for each finalized bucket, we compare the event_count to the true data in the matching date_histogram bucket. If the true data has a higher count than the event_count, then that is considered a discrepancy.

If a discrepancy is found, an Audit should be made suggesting an increase in the query delay. As more capabilities are added (possibly Annotations?), those could be utilized to give a better indication of how much data was missed over a given timerange.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions