[Data Frame] Bad date_histogram format causes infinitely running indexer

# Problem
Users have the ability to shoot themselves in the foot without much warning with Data Frames.

Example:
```
{
  "source": { "index": "my-index-*"},
  "dest"  : { "index": "my-data-frame"},
  "pivot": {
    "group_by": {
      "@timestamp": {
        "date_histogram": {
          "field": "@timestamp",
          "calendar_interval": "1m",
          "format": "yyyy-MM-dd HH:00"  <--- Uh OH!!!!
        }
      }
    },
    "aggregations": {
	    ....
    }
  }
}
```
This is a valid data frame definition, but the format of the key for the composite aggregation buckets has too few "time significant digits". This will result in many buckets that have the exact same pivot key. Two issues result from this:

* The data frame will happily run forever. If there are enough documents to span an entire page, such that all the documents are in the same hour, the data frame will continue to request the same page of the composite aggregation infinitely.
* Documents will overwrite each other. Since we generate the document `_id` values by the values of the composite aggregation bucket, all the buckets generated in the same hour would have the same `_id` and only the very last bucket seen would be retained.

# Solutions???

### Check if the interval and the format have the same time fidelity

The `format` field allows any of our valid time formats. We may be able to look at the base of the `calendar_interval` (e.g. `m => minutes`, `h => hours`, etc.) and compare it with a formatted timestamp. If we use the format provided against a epoch timestamp where we know all the digits are non-zero, it should be possible to verify that the format has the same fidelity (or higher) than the interval.

👍 Computationally efficient
👎  A tad complicated, logically

### Run sample queries and see if there are repeated keys

Only the `date_histogram` group_by would have to be considered. If the `date_histogram` aggregation is ran against a subset of the data, with the supplied format, each non-empty bucket key should be checked to see if there are any repeats.

👍 simple
👎 computationally inefficient
👎 not reliable. What if the subset of the queried data just happens to bucket where the keys are different?
 example of different keys but invalid format:
```
GET kibana_sample_data_flights/_search
{
  "aggs": {
    "buckets": {
      "composite": {
        "sources": [
          {
            "time": {
              "date_histogram": {
                "field": "timestamp",
                "interval": "second",
                "format": "yyyy-MM-dd HH:mm"
              }
            }
          }
        ]
      }
    }
  }
}
>"buckets" : [
        {
          "key" : {
            "time" : "2019-04-08 00:00"
          },
          "doc_count" : 1
        },
        {
          "key" : {
            "time" : "2019-04-08 00:02"
          },
          "doc_count" : 1
        },
        {
          "key" : {
            "time" : "2019-04-08 00:06"
          },
          "doc_count" : 1
        },...
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Data Frame] Bad date_histogram format causes infinitely running indexer #43068

Problem

Solutions???

Check if the interval and the format have the same time fidelity

Run sample queries and see if there are repeated keys

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Data Frame] Bad date_histogram format causes infinitely running indexer #43068

Description

Problem

Solutions???

Check if the interval and the format have the same time fidelity

Run sample queries and see if there are repeated keys

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions