-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Handle fields.with.dots in routing_path #83148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pinging @elastic/es-analytics-geo (Team:Analytics) |
Hi @nik9000, I've created a changelog YAML for you. |
I think this is something @weizijun asked for in #82511 and I said "I don't really want to change how x content filtering works." What a difference a week makes, I guess! I still don't particularly want to change the way filtering works, but in this case it's very very very important to line up with our parsing code. And, I think, there might be cases like this in the future. Thus, this PR. I'm going to have to rerun the benchmarks for this one. I mean, it's just one more branch, but still, we have the benchmarks. I should run them. |
7d49d55
to
ffd45bb
Compare
Hi @nik9000, I've created a changelog YAML for you. |
Our document parsing code treats: ``` {"foo.bar": "baz"} ``` the same as: ``` {"foo": {"bar": "baz"}} ``` but the code that extracted routing_path didn't! This is bad because routing_path has to identify a subset of dimensions precisely and in the same way that our mapping code identifies the dimensions. They have to line up or two time series could end up on different shards. Sad times. This makes `routing_path` interpret dots in field names as though they were an object. It creates an option for the includes/excludes filters on `XContentParser` to treats dots as objects. So the filter would see ``` {"foo.bar": "baz"} ``` as though it were: ``` {"foo": {"bar": "baz"}} ``` So the filter `foo.bar` would match both documents. This is part of elastic#82511.
cc @mushao999 , he is doing compare the different details between XContentParser and XContentMapValues.filter actions. |
👍 One thing I noticed is that this config option forces dots to be, separators. So without the option |
run elasticsearch-ci/part-1 |
I caused kind of a mess with the new changelog generation process. I force pushed and smashed the one the bot made and Jenkins was not pleased. The bot remade the changelog but didn't restart the build. |
Yeah, a little bit different, @mushao999 has studied the detail of |
@pugnascotia this failed the changelog validation process but I think the schema is missing an "area" so I'm adding one as part of this PR. |
I've chatted with @mushao999 and because #83152 is so similar to this he's going to extract the common bits of this into a new PR. He'll grab some tests from this. We'll poke at the PR together and then figure out what to do with this PR and #83152 once that's in. |
@nik9000 new area LGTM 👍 |
I think this could be ready for another round of review from @csoulios or @romseygeek or @imotov . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me as the first pass. I think we will need to revisit this at some point. We have a bit of a mess. While looking at this change I discovered that we ignore trailing dots. In other words "dim": "abc"
and "dim.": "abc"
are treated the same for the purposed of indexing and partially filtering (it is ignored in the field names but not in the mask).
In this PR doesn't deal with trailing dots the same way, but it is probably ok, the whole thing is weird anyway and we should revisit it sooner rather than later. Here is an example:
DELETE test
PUT test
{
"settings": {
"index.mode": "time_series",
"index.routing_path": "dim",
"index.time_series.start_time": "1970-01-01T00:00:01Z",
"index.time_series.end_time": "9999-12-31T23:59:59.998Z"
},
"mappings": {
"properties": {
"dim": {
"type": "keyword",
"time_series_dimension": true
},
"val": {
"type": "double"
}
}
}
}
PUT test/_bulk
{ "create":{ } }
{"dim.": "bar", "val": 3, "@timestamp": "2022-01-06T19:04:11Z"}
{ "create":{ } }
{"dim": "bar", "val": 10, "@timestamp": "2022-01-06T19:04:42Z"}
{ "create":{ } }
{"dim": "bar", "val": 10, "@timestamp": 1641501593000}
and here is the rest. Note, how the leading dot triggers field name cannot contain only whitespace: ['.val']
error:
DELETE test
PUT test/_bulk
{ "create":{ } }
{"val": 3}
{ "create":{ } }
{"val.": 10}
{ "create":{ } }
{".val": 4}
POST test/_search
{
"query": {
"range": {
"val": {
"gte": 0,
"lte": 20
}
}
},
"_source": ["val."]
}
POST test/_search
{
"query": {
"range": {
"val": {
"gte": 0,
"lte": 20
}
}
},
"_source": ["val"]
}
POST test/_search
{
"query": {
"range": {
"val": {
"gte": 0,
"lte": 20
}
}
},
"_source": [".val"]
}
👍 on revisit. I think it's important that routing generation works "just like" how we extract dimension - which is all based on how the mapping code works. So if trailing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Our document parsing code treats:
the same as:
but the code that extracted routing_path didn't! This is bad because
routing_path has to identify a subset of dimensions precisely and in the
same way that our mapping code identifies the dimensions. They have to
line up or two time series could end up on different shards. Sad times.
This makes
routing_path
interpret dots in field names as though theywere an object. It creates an option for the includes/excludes filters
on
XContentParser
to treats dots as objects. So the filter would seeas though it were:
So the filter
foo.bar
would match both documents.This is part of #82511.