Description
Problem
Users have the ability to shoot themselves in the foot without much warning with Data Frames.
Example:
{
"source": { "index": "my-index-*"},
"dest" : { "index": "my-data-frame"},
"pivot": {
"group_by": {
"@timestamp": {
"date_histogram": {
"field": "@timestamp",
"calendar_interval": "1m",
"format": "yyyy-MM-dd HH:00" <--- Uh OH!!!!
}
}
},
"aggregations": {
....
}
}
}
This is a valid data frame definition, but the format of the key for the composite aggregation buckets has too few "time significant digits". This will result in many buckets that have the exact same pivot key. Two issues result from this:
- The data frame will happily run forever. If there are enough documents to span an entire page, such that all the documents are in the same hour, the data frame will continue to request the same page of the composite aggregation infinitely.
- Documents will overwrite each other. Since we generate the document
_id
values by the values of the composite aggregation bucket, all the buckets generated in the same hour would have the same_id
and only the very last bucket seen would be retained.
Solutions???
Check if the interval and the format have the same time fidelity
The format
field allows any of our valid time formats. We may be able to look at the base of the calendar_interval
(e.g. m => minutes
, h => hours
, etc.) and compare it with a formatted timestamp. If we use the format provided against a epoch timestamp where we know all the digits are non-zero, it should be possible to verify that the format has the same fidelity (or higher) than the interval.
👍 Computationally efficient
👎 A tad complicated, logically
Run sample queries and see if there are repeated keys
Only the date_histogram
group_by would have to be considered. If the date_histogram
aggregation is ran against a subset of the data, with the supplied format, each non-empty bucket key should be checked to see if there are any repeats.
👍 simple
👎 computationally inefficient
👎 not reliable. What if the subset of the queried data just happens to bucket where the keys are different?
example of different keys but invalid format:
GET kibana_sample_data_flights/_search
{
"aggs": {
"buckets": {
"composite": {
"sources": [
{
"time": {
"date_histogram": {
"field": "timestamp",
"interval": "second",
"format": "yyyy-MM-dd HH:mm"
}
}
}
]
}
}
}
}
>"buckets" : [
{
"key" : {
"time" : "2019-04-08 00:00"
},
"doc_count" : 1
},
{
"key" : {
"time" : "2019-04-08 00:02"
},
"doc_count" : 1
},
{
"key" : {
"time" : "2019-04-08 00:06"
},
"doc_count" : 1
},...