Skip to content

[Data Frame] Bad date_histogram format causes infinitely running indexer #43068

Closed
@benwtrent

Description

@benwtrent

Problem

Users have the ability to shoot themselves in the foot without much warning with Data Frames.

Example:

{
  "source": { "index": "my-index-*"},
  "dest"  : { "index": "my-data-frame"},
  "pivot": {
    "group_by": {
      "@timestamp": {
        "date_histogram": {
          "field": "@timestamp",
          "calendar_interval": "1m",
          "format": "yyyy-MM-dd HH:00"  <--- Uh OH!!!!
        }
      }
    },
    "aggregations": {
	    ....
    }
  }
}

This is a valid data frame definition, but the format of the key for the composite aggregation buckets has too few "time significant digits". This will result in many buckets that have the exact same pivot key. Two issues result from this:

  • The data frame will happily run forever. If there are enough documents to span an entire page, such that all the documents are in the same hour, the data frame will continue to request the same page of the composite aggregation infinitely.
  • Documents will overwrite each other. Since we generate the document _id values by the values of the composite aggregation bucket, all the buckets generated in the same hour would have the same _id and only the very last bucket seen would be retained.

Solutions???

Check if the interval and the format have the same time fidelity

The format field allows any of our valid time formats. We may be able to look at the base of the calendar_interval (e.g. m => minutes, h => hours, etc.) and compare it with a formatted timestamp. If we use the format provided against a epoch timestamp where we know all the digits are non-zero, it should be possible to verify that the format has the same fidelity (or higher) than the interval.

👍 Computationally efficient
👎 A tad complicated, logically

Run sample queries and see if there are repeated keys

Only the date_histogram group_by would have to be considered. If the date_histogram aggregation is ran against a subset of the data, with the supplied format, each non-empty bucket key should be checked to see if there are any repeats.

👍 simple
👎 computationally inefficient
👎 not reliable. What if the subset of the queried data just happens to bucket where the keys are different?
example of different keys but invalid format:

GET kibana_sample_data_flights/_search
{
  "aggs": {
    "buckets": {
      "composite": {
        "sources": [
          {
            "time": {
              "date_histogram": {
                "field": "timestamp",
                "interval": "second",
                "format": "yyyy-MM-dd HH:mm"
              }
            }
          }
        ]
      }
    }
  }
}
>"buckets" : [
        {
          "key" : {
            "time" : "2019-04-08 00:00"
          },
          "doc_count" : 1
        },
        {
          "key" : {
            "time" : "2019-04-08 00:02"
          },
          "doc_count" : 1
        },
        {
          "key" : {
            "time" : "2019-04-08 00:06"
          },
          "doc_count" : 1
        },...

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions