-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Allow docvalues-only search on number types #82409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pinging @elastic/es-search (Team:Search) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a first pass through, looks good to me for the most part. We should think about allowing an escape hatch via expensive query checks so that users can ensure they don't accidentally start running very slow queries.
@Override | ||
public Query termQuery(Object value, SearchExecutionContext context) { | ||
failIfNotIndexed(); | ||
return type.termQuery(name(), value); | ||
failIfNotIndexedNorDocValues(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should have an expensive query check here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me. Added check in bae5818
@@ -446,6 +453,13 @@ protected final void failIfNotIndexed() { | |||
} | |||
} | |||
|
|||
protected final void failIfNotIndexedNorDocValues() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe failIfNotSearchable
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've renamed this (because of expensive queries check), but prefer the more concrete name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, and I like how the two checks are combined, very neat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments, the changes look great to me.
Relates to #52728 .
Do you plan on adding documentation too?
One thing that I would like to discuss is how we treat these doc_value queries as expensive ( I left a comment on that) and whether it's time to work on #48058 too to complete the picture. It seems like the allow expensive queries flag is no longer enough.
Thanks for working on this!
@@ -218,6 +218,8 @@ tasks.named("yamlRestTestV7CompatTransform").configure { task -> | |||
// sync_id is no longer available in SegmentInfos.userData // "indices.flush/10_basic/Index synced flush rest test" | |||
task.replaceIsTrue("indices.testing.shards.0.0.commit.user_data.sync_id", "indices.testing.shards.0.0.commit.user_data") | |||
|
|||
// we can now search using doc values only | |||
task.replaceValueInMatch("fields.object\\.nested1.long.searchable", true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you remind me what this does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This task runs the yaml test suite from ES 7.x against ES 8. By supporting searches on doc-values-only fields, we now have a "breaking" change as the test from 7.x returns different results, namely this field now appears as "searchable" in 8. This is ok though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thanks for explaining.
x-pack/plugin/rollup/src/main/java/org/elasticsearch/xpack/rollup/v2/RollupShardIndexer.java
Outdated
Show resolved
Hide resolved
@@ -446,6 +456,22 @@ protected final void failIfNotIndexed() { | |||
} | |||
} | |||
|
|||
protected final void failIfNotIndexedNorDocValuesFallback(SearchExecutionContext context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one thing throws me off a bit here: the fact that a field is searchable or not effectively depends on whether expensive queries are allowed if only doc_values are present. For doc_value only fields though field_caps will always return true, yet may return an error when the field is searched?
On the other hand this is how expensive queries worked until now, even if the field is searchable they may throw error, but I am not sure we ever did this for basic queries like a term query. I believe this could be the first time that a field is effectively not searchable if expensive queries are disallowed, because no query can be performed against it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same applies to term/range/etc. queries on runtime fields while those are searchable as well, so this isn't something new. I also think that that's ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true, how could I forget about runtime fields? :) I also don't see field_caps output being based on whether expensive queries are allowed or not, though this is something to keep in mind as we introduce more expensive queries/fields.
/** | ||
* Returns true if the field is indexed. | ||
*/ | ||
public final boolean isIndexed() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, the distinction between searchable and indexed is something that we removed a while ago, if I remember correctly, but we ended up using isSearchable in many places where we really meant isIndexed instead and I found it confusing (even if the two meant the same until now). I find it much clearer now with this distinction, and I double checked that you fixed all the places that need to call isIndexed. Great!
@@ -266,9 +266,13 @@ public Float parse(XContentParser parser, boolean coerce) throws IOException { | |||
} | |||
|
|||
@Override | |||
public Query termQuery(String field, Object value) { | |||
public Query termQuery(String field, Object value, boolean forceDocValues) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I am not entirely happy with the naming of the new parameter. I think forceDocValues
fits better for range queries where we use IndexOrDocValuesQuery, but here it's more about either one or the other. I also struggle slightly with always providing isIndexed == false
to it. Shall we rather call it isIndexed for termQuery
? I find that that would express better the fact that the index has the precedence if present.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, fixed in 4b273fd
@@ -289,7 +293,8 @@ public Query rangeQuery( | |||
boolean includeLower, | |||
boolean includeUpper, | |||
boolean hasDocValues, | |||
SearchExecutionContext context | |||
SearchExecutionContext context, | |||
boolean forceDocValues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I can see why the additional argument is called forceDocValues here, because we explicitly want to query only doc_values, but here too maybe isIndexed would also express that if the field is indexed we query that or doc_values, while if it's not indexed we can only query doc_values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, fixed in 4b273fd
Not as part of this PR yet. For the core feature, I don't think this needs extra documentation. In the context of archive data (#81210), I want to add some docs once these pieces are integrated (i.e. mappings are automatically transferred over).
As noted on the comment, this PR keeps things in line with how we have handled that flag previously. I had a look at #48058 and it seems to the UX/UI part isn't figured out (which requires cross-team collab), so it feels too early to kick off actual development work on our side. |
@elasticmachine run elasticsearch-ci/part-2 (known test failure: #82420) |
On docs, I think there are bits that need to be updated besides what you mentioned around archive data. See https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-index.html where we say that fields that are not indexed are not queryable. That is still true in general but not for numeric types. Also on https://www.elastic.co/guide/en/elasticsearch/reference/current/number.html#number-params we say that whether a number is searchable depends on whether it's indexed or not. I would probably also update the bit on doc_values for numeric fields to mention that queries can exercise them in some cases, when they are not indexed, or when range queries are executed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the docs changes, they look good to me. Last bit: should we also add some more info on the fact that doc_value queries are considered expensive where they are mentioned, and in query-dsl where the flag is mentioned? I see we have a (probably outdated) list of things that are considered expensive, I wonder if it's worth updating that.
@@ -17,6 +17,12 @@ makes this data access pattern possible. They store the same values as the | |||
sorting and aggregations. Doc values are supported on almost all field types, | |||
with the __notable exception of `text` and `annotated_text` fields__. | |||
|
|||
Numeric types, such as `long` and `double`, can also be queried when they are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may I ask why you mention only long and double?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wasn't meant to be an exhaustive list. I've copied this phrasing from a different part of the docs. I've added a link on "Numeric types" now so that someone interested can look up all the types this applies to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
I've pushed some more doc changes in f9efb59.
Not sure what exactly what you want to see added. I mentioned everywhere that these queries are slow, and explained about the tradeoffs. |
What I meant is clarifying that the expensive queries flag affect these queries which may not be obvious. For instance for runtime fields we mention that queries on them are deemed expensive and can be disabled with such flag. I am fine with adding doc_value queries to the list of things that are blocked with the flag though, like you did. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks @romseygeek and @javanna! |
Similar to #82409, but for date fields. Allows searching on date field types (date, date_nanos) when those fields are not indexed (index: false) but just doc values are enabled. This enables searches on archive data, which has access to doc values but not index structures. When combined with searchable snapshots, it allows downloading only data for a given (doc value) field to quickly filter down to a select set of documents. Relates #81210 and #52728
I added the |
Similar to #82409, but for geo_point fields. Allows searching on geo_point fields when those fields are not indexed (index: false) but just doc values are enabled. Also adds distance feature query support for date fields (bringing date field to feature parity with runtime fields) This enables searches on archive data, which has access to doc values but not index structures. When combined with searchable snapshots, it allows downloading only data for a given (doc value) field to quickly filter down to a select set of documents. Relates #81210 and #52728
Marks PR #82409 as a notable release highlight in its changelog YAML file. This ensures the highlight displays in the [Elastic Upgrade Guide](https://www.elastic.co/guide/en/elastic-stack/8.1/elasticsearch-highlights.html). Relates to #84041
if (isIndexed()) { | ||
return type.termsQuery(name(), values); | ||
} else { | ||
return super.termsQuery(values, context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would DocValuesNumbersQuery be a better solution if there are so many terms?
Allows searching on number field types (long, short, int, float, double, byte, scaled_float, half_float) when those fields are not indexed (index: false) but just doc values are enabled.
This enables searches on archive data, which has access to doc values but not index structures. When combined with searchable snapshots, it allows downloading only data for a given (doc value) field to quickly filter down to a select set of documents.
Note to reviewers:
I have split
isSearchable
into two separate methodsisIndexed
andisSearchable
onMappedFieldType
. The former one is about whether actual indexing data structures have been used (postings or points), and the latter one on whether you can run queries on the given field (e.g. used by field caps). For number field types, queries are now allowed whenever points are available or when doc values are available (i.e. searchability is expanded).Relates #81210 and #52728