Skip to content

Handle fields.with.dots in routing_path #83148

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Feb 2, 2022

Conversation

nik9000
Copy link
Member

@nik9000 nik9000 commented Jan 26, 2022

Our document parsing code treats:

{"foo.bar": "baz"}

the same as:

{"foo": {"bar": "baz"}}

but the code that extracted routing_path didn't! This is bad because
routing_path has to identify a subset of dimensions precisely and in the
same way that our mapping code identifies the dimensions. They have to
line up or two time series could end up on different shards. Sad times.

This makes routing_path interpret dots in field names as though they
were an object. It creates an option for the includes/excludes filters
on XContentParser to treats dots as objects. So the filter would see

{"foo.bar": "baz"}

as though it were:

{"foo": {"bar": "baz"}}

So the filter foo.bar would match both documents.

This is part of #82511.

@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jan 26, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@elasticsearchmachine
Copy link
Collaborator

Hi @nik9000, I've created a changelog YAML for you.

@nik9000
Copy link
Member Author

nik9000 commented Jan 26, 2022

I think this is something @weizijun asked for in #82511 and I said "I don't really want to change how x content filtering works." What a difference a week makes, I guess! I still don't particularly want to change the way filtering works, but in this case it's very very very important to line up with our parsing code. And, I think, there might be cases like this in the future. Thus, this PR.

I'm going to have to rerun the benchmarks for this one. I mean, it's just one more branch, but still, we have the benchmarks. I should run them.

@elasticsearchmachine
Copy link
Collaborator

Hi @nik9000, I've created a changelog YAML for you.

Our document parsing code treats:
```
{"foo.bar": "baz"}
```
the same as:
```
{"foo": {"bar": "baz"}}
```
but the code that extracted routing_path didn't! This is bad because
routing_path has to identify a subset of dimensions precisely and in the
same way that our mapping code identifies the dimensions. They have to
line up or two time series could end up on different shards. Sad times.

This makes `routing_path` interpret dots in field names as though they
were an object. It creates an option for the includes/excludes filters
on `XContentParser` to treats dots as objects. So the filter would see
```
{"foo.bar": "baz"}
```
as though it were:
```
{"foo": {"bar": "baz"}}
```

So the filter `foo.bar` would match both documents.

This is part of elastic#82511.
@weizijun
Copy link
Contributor

cc @mushao999 , he is doing compare the different details between XContentParser and XContentMapValues.filter actions.

@nik9000
Copy link
Member Author

nik9000 commented Jan 26, 2022

cc @mushao999 , he is doing compare the different details between XContentParser and XContentMapValues.filter actions.

👍

One thing I noticed is that this config option forces dots to be, separators. So without the option f*r matches foo.bar but with the option it doesn't. I'll bet anything doing Regex.simpleMatch on the field names is going to still match f*r. but I'm not super familiar with what XContentMapValues.filter is doing here. And I really don't want to break backwards compatibility with whatever it's doing now. At least, not without very carefully thinking about it. Too many folks rely on it.

@nik9000
Copy link
Member Author

nik9000 commented Jan 26, 2022

run elasticsearch-ci/part-1

@nik9000
Copy link
Member Author

nik9000 commented Jan 26, 2022

run elasticsearch-ci/part-1

I caused kind of a mess with the new changelog generation process. I force pushed and smashed the one the bot made and Jenkins was not pleased. The bot remade the changelog but didn't restart the build.

@weizijun
Copy link
Contributor

but I'm not super familiar with what XContentMapValues.filter is doing here. And I really don't want to break backwards compatibility with whatever it's doing now. At least, not without very carefully thinking about it. Too many folks rely on it.

Yeah, a little bit different, @mushao999 has studied the detail of XContentMapValues.filter, later he will show the detail.

@nik9000
Copy link
Member Author

nik9000 commented Jan 26, 2022

@pugnascotia this failed the changelog validation process but I think the schema is missing an "area" so I'm adding one as part of this PR.

@nik9000
Copy link
Member Author

nik9000 commented Jan 26, 2022

I've chatted with @mushao999 and because #83152 is so similar to this he's going to extract the common bits of this into a new PR. He'll grab some tests from this. We'll poke at the PR together and then figure out what to do with this PR and #83152 once that's in.

@pugnascotia
Copy link
Contributor

@nik9000 new area LGTM 👍

elasticsearchmachine pushed a commit that referenced this pull request Jan 28, 2022
Current `FilterPathBasedFilter` does not support match fieldName with
dot. This PR merges the changes of #83148 and #83152 together to add the
ability of `matching fieldName with dot` to the `FilterPathBasedFilter`.
@nik9000
Copy link
Member Author

nik9000 commented Jan 28, 2022

I think this could be ready for another round of review from @csoulios or @romseygeek or @imotov .

Copy link
Contributor

@imotov imotov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me as the first pass. I think we will need to revisit this at some point. We have a bit of a mess. While looking at this change I discovered that we ignore trailing dots. In other words "dim": "abc" and "dim.": "abc" are treated the same for the purposed of indexing and partially filtering (it is ignored in the field names but not in the mask).

In this PR doesn't deal with trailing dots the same way, but it is probably ok, the whole thing is weird anyway and we should revisit it sooner rather than later. Here is an example:

DELETE test
PUT test
{
  "settings": {
    "index.mode": "time_series",
    "index.routing_path": "dim",
    "index.time_series.start_time": "1970-01-01T00:00:01Z",
    "index.time_series.end_time": "9999-12-31T23:59:59.998Z"
  },
  "mappings": {
    "properties": {
      "dim": {
        "type": "keyword",
        "time_series_dimension": true
      },
      "val": {
        "type": "double"
      }
    }
  }
}

PUT test/_bulk
{ "create":{ } }
{"dim.": "bar", "val": 3, "@timestamp": "2022-01-06T19:04:11Z"}
{ "create":{ } }
{"dim": "bar", "val": 10, "@timestamp": "2022-01-06T19:04:42Z"}
{ "create":{ } }
{"dim": "bar", "val": 10, "@timestamp": 1641501593000}

and here is the rest. Note, how the leading dot triggers field name cannot contain only whitespace: ['.val'] error:

DELETE test

PUT test/_bulk
{ "create":{ } }
{"val": 3}
{ "create":{ } }
{"val.": 10}
{ "create":{ } }
{".val": 4}

POST test/_search
{
  "query": {
    "range": {
      "val": {
        "gte": 0,
        "lte": 20
      }
    }
  }, 
  "_source": ["val."]
}


POST test/_search
{
  "query": {
    "range": {
      "val": {
        "gte": 0,
        "lte": 20
      }
    }
  }, 
  "_source": ["val"]
}


POST test/_search
{
  "query": {
    "range": {
      "val": {
        "gte": 0,
        "lte": 20
      }
    }
  }, 
  "_source": [".val"]
}

@nik9000
Copy link
Member Author

nik9000 commented Feb 1, 2022

👍 on revisit. I think it's important that routing generation works "just like" how we extract dimension - which is all based on how the mapping code works. So if trailing .s are allowed there, we need to make them allowed here. But that can be a follow up. Also, ew. Trailing dots in field names is weird. I imagine folks usually do it by mistake? I dunno.

Copy link
Contributor

@romseygeek romseygeek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@nik9000 nik9000 merged commit be09f8b into elastic:master Feb 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :StorageEngine/TSDB You know, for Metrics Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants