Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] [RFC] Query Shape Field Data Type #69

Closed
dzane17 opened this issue Aug 7, 2024 · 25 comments · Fixed by #140
Closed

[FEATURE] [RFC] Query Shape Field Data Type #69

dzane17 opened this issue Aug 7, 2024 · 25 comments · Fixed by #140
Assignees
Labels
enhancement New feature or request v2.18.0 Issues targeting release v2.18.0

Comments

@dzane17
Copy link
Collaborator

dzane17 commented Aug 7, 2024

Background

Query shape currently has the ability to append field names. In addition, we'd like the option to add field data type (often called fieldType in code).

In this example, field1, field2, field3 is are field names and text, date, keyword are data types.

bool
  filter:
    terms [field1:text]
  must:
    range [field2:date]
    term [field3:keyword]
  must_not:
    exists [field4:keyword]
  should:
    match [field5:text]
sort:
  asc [field6:number]
  desc [field2:date]
aggregation:
  terms [field3:keyword]
    aggregation:
      avg [field6:number]
      cardinality [field3:keyword]
      date_histogram [field2:date]
      max [field6:number]
      min [field6:number]
      percentile_ranks [field6:number]
      sum [field6:number]

Problem

It is impossible to determine data type from just the search source.

In this match query, field name is "title", however the exact data type is unknown. It could be text or keyword.

GET /my_index/_search
{
  "query": {
    "match_phrase": {
      "title": "Second Document"
    }
  }
}

What solution would you like?

Three possible solutions:

  1. Record dataType in each *QueryBuilder, *AggregationBuilder, *SortBuilder class during core search execution. Then when records are processed in query-insights plugin we can simply call aggBuilder.getDataType(). I have seen builder classes get the data type mapping from shardContext like:
final MappedFieldType fieldType = shardContext.fieldMapper(fieldName);

Cons: Not all builders look up data type, Need to edit *Builder classes in core

  1. Fetch mappings from query-insights plugin. Then we can get data type from known field name.
    Note: Need to find a way to fetch _mapping data from query-insights

  2. Ignore field data type in query shape

In many cases, data type is known given the query/agg/sort type. For example, data type for date histogram aggregations is always date, boolean query is always boolean data type. In these cases, adding data type adds no value:

    aggregation:
      date_histogram [date]

vs. 


    aggregation:
      date_histogram

On the other hand, Range queries support a variety of data types so we would lose information with this option.
a. Date

GET /my_index/_search
{
  "query": {
    "range": {
      "publish_date": {
        "gte": "2024-01-01",
        "lte": "2024-12-31"
      }
    }
  }
}

b. Keyword

GET /my_index/_search
{
  "query": {
    "range": {
      "numeric_as_string": {
        "gte": "10",
        "lte": "100"
      }
    }
  }
}

c. Numeric (which consists of int, long, float, double, short, byte)

GET /my_index/_search
{
  "query": {
    "range": {
      "price": {
        "gte": 10,
        "lte": 100
      }
    }
  }
}

Other Consideration

  1. When data type is String or Numeric, do we need to distinguish between specific types (eg. int, long, float, double, short, byte)?
@dzane17 dzane17 added enhancement New feature or request untriaged labels Aug 7, 2024
@ansjcy
Copy link
Member

ansjcy commented Aug 8, 2024

Thanks for the RFC!

Not all builders look up data type, Need to edit *Builder classes in core

Do we know what is the coverage? i.e. how many builders and what are the builders have data type available (can we attach a more detailed list here [1] )?

Fetch mappings from query-insights plugin.

This option is not as intrusive as #1, but there are chances the mapping could be outdated. If we want to keep the mapping up to date, we need to constantly fetch and refersh them.

Instead of adding field data type to all query types, we should probably identify the important ones and only add type info for them. For example, as you mentioned we can always add types for range queries if type information can be easily fetched based on researches done in [1].

@dzane17
Copy link
Collaborator Author

dzane17 commented Aug 9, 2024

@ansjcy Looks like most builders don't already look up data type

Count of all Query/Agg/Sort Builders

% find ./server/src/main/java/org/opensearch -type f -name '*QueryBuilder.java' | wc -l
      50
% find ./server/src/main/java/org/opensearch -type f -name '*AggregationBuilder.java' | wc -l
      62
% find ./server/src/main/java/org/opensearch -type f -name '*SortBuilder.java' | wc -l
       6

Count of Builders which look up data type

% find ./server/src/main/java/org/opensearch -type f -name '*QueryBuilder.java' -exec grep -l 'MappedFieldType' {} + | wc -l
      23
% find ./server/src/main/java/org/opensearch -type f -name '*AggregaionBuilder.java' -exec grep -l 'MappedFieldType' {} + | wc -l
       0
% find ./server/src/main/java/org/opensearch -type f -name '*SortBuilder.java' -exec grep -l 'MappedFieldType' {} + | wc -l
       2

@jainankitk
Copy link
Collaborator

@dzane17 - Thank you for putting up this proposal and different possible options. While I am leaning towards option 1, I am wondering about below regarding the same:

  • How much work are talking per *Builder class? I am not concerned as much about the number of classes, as I am about how intrusive the change is.
  • Are there common/base *Builder class used across multiple subclasses not requiring change in all the subclasses?
  • How will be sustain this in future as and when new *Builder class are introduced?

I am slightly less in favor of Option 2, as I see no good reason for Query Insights to have the logic for parsing index mapping.

@dzane17
Copy link
Collaborator Author

dzane17 commented Aug 16, 2024

@jainankitk

  • How much work are talking per *Builder class? I am not concerned as much about the number of classes, as I am about how intrusive the change is.

Query Builders look up field type during the doToQuery() method. We can save this value as a class variable during search execution. Then we will have it when search records are processed in the query insights plugin. This way the change is not very intrusive. Here are some examples of data type lookup in different Builder classes:

As mentioned previously, not all *Builder classes currently look up fieldType. In those instances we will not have data type or will need to add a lookup.

  • Are there common/base *Builder class used across multiple subclasses not requiring change in all the subclasses?

Yes, we can add the class variable and getter method in some parent class, but we will still need to edit child Builder classes to set the values.

  • How will be sustain this in future as and when new *Builder class are introduced?

New builder classes will not work by default. They will require onboarding in up to 2 places

  1. Query-Insights plugin to read data type of the new Builder
  2. OpenSearch core Builder class to store data type

@deshsidd
Copy link
Collaborator

@dzane17 Thanks for the proposal and I too am leaning towards option 1 as we should avoid having this logic in the query-insights plugin.

For option 1 I am concerned regarding the following:

This option is not as intrusive as #1, but there are chances the mapping could be outdated. If we want to keep the mapping up to date, we need to constantly fetch and refresh them.

Please let us know if this would be an issue and how we plan to handle this?

@dzane17
Copy link
Collaborator Author

dzane17 commented Aug 19, 2024

@deshsidd Actually this concern is related to option 2. There is a time interval between core search execution and generating query shape in Query Insights plugin, so we cannot guarantee that the index mapping for a particular field is present or accurate. This will impact a minute percent of query shapes but is certainly a flaw. Even if we routinely fetch index mappings from Query Insights, I don't see a way to lock the exact mappings at the time of execution.

@deshsidd
Copy link
Collaborator

Got it. In that case approach 1 looks good to me. Only drawback I see is the following:

New builder classes will not work by default. They will require onboarding in up to 2 places

Is there any way we can enforce new builders to onboard by default (at least in the core code?).

@jainankitk
Copy link
Collaborator

jainankitk commented Aug 20, 2024

Query Builders look up field type during the doToQuery() method. We can save this value as a class variable during search execution. Then we will have it when search records are processed in the query insights plugin. This way the change is not very intrusive. Here are some examples of data type lookup in different Builder classes:

As mentioned previously, not all *Builder classes currently look up fieldType. In those instances we will not have data type or will need to add a lookup.

Yes, we can add the class variable and getter method in some parent class, but we will still need to edit child Builder classes to set the values.

New builder classes will not work by default. They will require onboarding in up to 2 places

Query-Insights plugin to read data type of the new Builder
OpenSearch core Builder class to store data type

Thanks @dzane17 for the response. I am wondering if we can modify the toQuery method within AbstractQueryBuilder since it already has access to the MapperService via QueryShardContext. Similar to abstract doToQuery method, we can have method for getting the fieldName that can be implemented by any new implementation of AbstractQueryBuilder. Hence, we won't need to make any changes to any concrete QueryBuilder for getting the type, but just provide way to get the fieldName. I guess we can also get rid of the HashMap in Query Insights plugin for getting the fieldNames with this approach. This should also ensure that any new QueryBuilder implements this abstract method for getting the field name, thus automatically providing us a way to get the field name and corresponding type.

@dblock dblock removed the untriaged label Aug 26, 2024
@dblock
Copy link
Member

dblock commented Aug 26, 2024

[Catch All Triage - 1, 2, 3, 4, 5]

@msfroh
Copy link

msfroh commented Aug 31, 2024

A possible solution that's backwards compatible and won't break plugins would be adding a HasFieldName interface that exposes the fieldName method. Then you add that interface to every QueryBuilder that has a fieldName method. Then in your QueryBuilderVisitor method, you do an instanceof check to see if the visited query implements the interface.

@jainankitk
Copy link
Collaborator

A possible solution that's backwards compatible and won't break plugins would be adding a HasFieldName interface that exposes the fieldName method. Then you add that interface to every QueryBuilder that has a fieldName method. Then in your QueryBuilderVisitor method, you do an instanceof check to see if the visited query implements the interface.

@msfroh - While this works well if only the fieldName were required, we also need the corresponding fieldType (for richer query shape data) which can be seamlessly computed within the toQuery method as it has access to the QueryShardContext and thus the MapperService. Probably, the same change works well without having additional abstract method, only caveat being the future extensions of AbstractQueryBuilder missing the fieldName, and thus fieldType information. Maybe, it is the right tradeoff to not risk breaking present consumers:

    /**
     * Default method for child classes which do not have a custom {@link #fieldName()} implementation.
     */
    public String fieldName() {
        return null; // To be overridden for classes supporting fieldName
    };

    /**
     * Returns field type as String for QueryBuilder classes which have a defined fieldName.
     * Else returns null.
     */
    public final String getFieldType() {
        return fieldType;
    };

    @Override
    public final Query toQuery(QueryShardContext context) throws IOException {
        <existing stuff>
        fieldType = context.getFieldTypeString(fieldName());
        return query;
    }

We also considered an interface similar to MultiTermQueryBuilder for getting both thefieldName and fieldType information, but could not find seamless way of plugging in the fieldType implementation like the current approach. Maybe another way is to introduce extension of AbstractQueryBuilder with the fieldName and fieldType information, but not sure if it is worth forking AbstractQueryBuilder just for this reason.

@msfroh
Copy link

msfroh commented Sep 1, 2024

There is no trade-off. I don't know how I can make this any clearer:

WE ARE NOT GOING TO STOP CURRENT PLUGINS FROM COMPILING AGAINST 2.X.

Edit: also, don't try to piggy-back on the toQuery call. If the query insights plug-in needs mapping information, it should take a dependency on MapperService itswlf, via its plugin initialization.

@jainankitk
Copy link
Collaborator

Edit: also, don't try to piggy-back on the toQuery call. If the query insights plug-in needs mapping information, it should take a dependency on MapperService itswlf, via its plugin initialization.

@msfroh - There is no piggy-backing here. Currently, QueryShapeGenerator is unaware of the Index / Shard, the shape is being generated for. It just needs to QueryBuilder for generating the shape. The question is not just about taking dependency on the MapperService, but making QueryShapeGenerator shard aware which is unnecessary IMO.

@msfroh
Copy link

msfroh commented Sep 1, 2024

Then what does toQuery have to do with anything? Query insights doesn't need to be involved when the QB gets turned into a query.

You said that Query Insights needs to get from the field name to the field type. That requires the index name (which it can get from the SearchRequest) and the MappingService.

Note that by design, QueryBuilder doesn't hold info about field types. QueryBuilder is essentially an AST for the query. It uses context (without holding onto it) to produce the appropriate Lucene query for the mapping.

@jainankitk
Copy link
Collaborator

Note that by design, QueryBuilder doesn't hold info about field types. QueryBuilder is essentially an AST for the query. It uses context (without holding onto it) to produce the appropriate Lucene query for the mapping.

QueryBuilder being an AST makes strong case for it to contain the field information!? The leaf nodes in AST are nothing but the expression, where one of the operand generally is the field name.

@msfroh
Copy link

msfroh commented Sep 2, 2024

QueryBuilder being an AST makes strong case for it to contain the field information!?

If you look at https://en.wikipedia.org/wiki/Compiler#Front_end, the QueryBuilder is more precisely the "parse tree", output by syntax analysis, before semantic analysis. The semantic information is mixed via the toQuery to produce the Lucene queries. The QueryBuilder does not "become" the Lucene query, but rather it is able to combine its own syntax with the semantic context (i.e. the field mappings) to produce the Lucene query.

Can you please take the time to understand how QueryBuilder works before breaking a core OpenSearch class?

@jainankitk
Copy link
Collaborator

If you look at https://en.wikipedia.org/wiki/Compiler#Front_end, the QueryBuilder is more precisely the "parse tree", output by syntax analysis, before semantic analysis. The semantic information is mixed via the toQuery to produce the Lucene queries. The QueryBuilder does not "become" the Lucene query, but rather it is able to combine its own syntax with the semantic context (i.e. the field mappings) to produce the Lucene query.

QueryBuilder within OpenSearch is definitely more than just the "parse tree". It has more complex functionalities like query rewrite which comes even after semantic analysis. Also, can we limit the scope of this discussion to OpenSearch specific implementation? Knowing generic compiler theory from wikipedia will not help us finalize the approach here.

Can you please take the time to understand how QueryBuilder works before breaking a core OpenSearch class?

Are you still referring to the abstract method when breaking a core OpenSearch class? We already agreed to not have the abstract method.

@msfroh
Copy link

msfroh commented Sep 3, 2024

It has more complex functionalities like query rewrite which comes even after semantic analysis

What does the rewrite method take as its input? It takes the semantic context, which it uses to return a different QueryBuilder (which still does not hold the semantic context, but rather the query may be simpler/more efficient based on semantic context).

@msfroh
Copy link

msfroh commented Sep 3, 2024

The point that I'm trying (and evidently failing) to communicate is that QueryBuilder models syntax not semantics. The mapped field types are part of the semantic context in which a query executes.

This is why I googled "abstract syntax tree" to see where the confusion might lie. In my mind, syntax is distinct from semantics.

@msfroh
Copy link

msfroh commented Sep 3, 2024

Are you still referring to the abstract method when breaking a core OpenSearch class? We already agreed to not have the abstract method.

I'm referring to the code snippet in #69 (comment), which is also a terrible idea.

@msfroh
Copy link

msfroh commented Sep 4, 2024

Chatted w/ @jainankitk offline. We realized that one of the biggest challenges here is the fact that Query Insights runs at the coordinator layer, not the shard layer. So, the toQuery logic in QueryBuilder isn't even helpful, since that runs at the shard level (and the shards don't sent the QueryBuilder back).

At the coordinator level, we don't have much information at all about the index mappings, which surprises me a little. (I had always assumed there would be some high-level awareness of index schema at the coordinator level.) The closest path that I see is ClusterService -> ClusterState (via state()) -> IndexMetadata (via getIndices())-> MappingMetadata, which has the mapping source. Using sourceAsMap(), you can read off field definitions, but it's not a proper object representation. It does look like other plugins do that, though, like the kNN plugin: https://github.com/opensearch-project/k-NN/blob/8a353422ce681a83a05f26af20817f14d42b2b3e/src/main/java/org/opensearch/knn/index/util/IndexUtil.java#L88

Here are more examples of OpenSearch plugins looking at MappingMetadata: https://github.com/search?q=org%3Aopensearch-project+MappingMetadata+NOT+repo%3Aopensearch-project%2FOpenSearch&type=code

@jainankitk
Copy link
Collaborator

Thanks @msfroh for succinctly capturing our discussion offline!

We realized that one of the biggest challenges here is the fact that Query Insights runs at the coordinator layer, not the shard layer. So, the toQuery logic in QueryBuilder isn't even helpful, since that runs at the shard level (and the shards don't sent the QueryBuilder back).

This is a strong reason to not have fieldType information as part of the QueryBuilder.

At the coordinator level, we don't have much information at all about the index mappings, which surprises me a little. (I had always assumed there would be some high-level awareness of index schema at the coordinator level.)

Yeah, this is big part of the problem.

The closest path that I see is ClusterService -> ClusterState (via state()) -> IndexMetadata (via getIndices())-> MappingMetadata, which has the mapping source. Using sourceAsMap(), you can read off field definitions, but it's not a proper object representation. It does look like other plugins do that, though, like the kNN plugin: https://github.com/opensearch-project/k-NN/blob/8a353422ce681a83a05f26af20817f14d42b2b3e/src/main/java/org/opensearch/knn/index/util/IndexUtil.java#L88
Here are more examples of OpenSearch plugins looking at MappingMetadata: https://github.com/search?q=org%3Aopensearch-project+MappingMetadata+NOT+repo%3Aopensearch-project%2FOpenSearch&type=code

Thanks for these pointers. I am working with @dzane17 to see if there is good way of getting the type information corresponding to field name. Although, I am also wondering if search request is targeting multiple indices, the same field might be indexed as multiple types across those indices. For example - a field called id could be indexed as both int or long across 2 indices. Is that even possible? If yes, how do we even decide the type of field for such cases (within context of coordinator search request)?

@msfroh
Copy link

msfroh commented Sep 4, 2024

For example - a field called id could be indexed as both int or long across 2 indices. Is that even possible? If yes, how do we even decide the type of field for such cases (within context of coordinator search request)?

As far as I know, it's possible. In that case, on the one index, the query will run against an int field and on the other will run against a long field. From the Query Insights standpoint, it's technically both.

This is where the idea of a coordinator-level "field type" is a bit messy. Coordinators just deal with the syntactic structure of the query. The semantic field type context is only resolved on the shards (and different shards may resolve different types).

@jainankitk
Copy link
Collaborator

As far as I know, it's possible. In that case, on the one index, the query will run against an int field and on the other will run against a long field. From the Query Insights standpoint, it's technically both.

Although possible, I would assume such cases to be rare for same workload type. Having different types associated for same field names would be pretty confusing for the client as well, I guess!? If true, from Query Insights perspective, we can ensure the field type across all the indices within search request is same, else we ignore the field type.

This is where the idea of a coordinator-level "field type" is a bit messy. Coordinators just deal with the syntactic structure of the query. The semantic field type context is only resolved on the shards (and different shards may resolve different types).

Ideally speaking, coordinators should be resolving "field type" at index-level for doing the necessary validations. Fanning out the request to 100s of shards and duplicating validation seems wasteful to me

@ansjcy
Copy link
Member

ansjcy commented Sep 10, 2024

Following up from the discussions and PRs, looks like with @dzane17 's changes we should be able to get field name form the *QueryBuilders now.

Also once fieldName is known, we can lookup fieldType via index mappings.

Index mappings can be stored as part of the cluster state, which is held by the master nodes of the cluster, or in the metadata of each index on disk. If we get the index mapping for each search request, it will potentially have an impact on the performance.
In this case, do we plan to maintain some local cache / map to keep track of the field type / index mapping? it might be over-engineering because field type might be a small use case. If we cannot justify the sacrifice for performance / engineering effort, can we simply ignore the field type for those are not available?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request v2.18.0 Issues targeting release v2.18.0
Projects
Status: New
Status: Done
Development

Successfully merging a pull request may close this issue.

7 participants