-
Notifications
You must be signed in to change notification settings - Fork 2.3k
[Approximation Framework] Extending approximation to top-level term queries on numeric fields #18679
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
I'm waiting on #18530 to get merged before rebasing and refactoring my tests to integrate with the parameterized versions in |
|
❌ Gradle check result for e8e55c4: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
|
❌ Gradle check result for ce6db1c: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #18679 +/- ##
============================================
- Coverage 72.86% 72.72% -0.15%
+ Complexity 68971 68820 -151
============================================
Files 5601 5601
Lines 316435 316465 +30
Branches 45880 45887 +7
============================================
- Hits 230586 230155 -431
- Misses 67218 67656 +438
- Partials 18631 18654 +23 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ApproximateTermQuerywrapping looks clean. Two concerns:
- Does this maintain correct scoring semantics for numeric term queries?
- How does approximation error propagate in compound queries containing these terms?
So I'm not actually implementing
ApproximateTermQuery(it would introduce unnecessary overhead). Instead, I'm simply replacing Lucene'snewExactQuery(internally aPointRangeQueryon the same upper and lower field) with an equivalentApproximatePointRangeQueryinNumberFieldMapper.
- Scoring should not be impacted for term queries on numeric fields since they are constant score queries. Also, a
PointRangeQueryandApproximatePointRangeQueryare identical scoring wise.
Makes sense.
- Approximation is not currently applied for any clause in a boolean query or any case where it isn't a top level query. Result inaccuracy should not be an issue here.
This is an interesting point. Is there an architectural limitation due to which this is not possible? If NumberFieldMapper.termQuery() returns ApproximatePointRangeQuery, why wouldn't it work inside a bool query? The query rewriting happens at the leaf level.
IMO this is a limitation that needs to be fixed to a wider real world applicability.
Parameterized tests are good but some edge cases to consider:
- What happens with numeric overflow scenarios?
- Are negative numbers handled correctly in the approximation?
- Need explicit test for approximation vs exact result validation.
I believe existing tests already validate approximate vs exact results. See
testApproximateVsExactQueryinApproximatePointRangeQueryTests. Edge cases like numeric overflows and negative numbers are handled appropriately inApproximatePointRangeQuery.Nice integration with existing approximation framework. Consider:
- Should this be configurable per query? Some use cases may require exact numeric matching.
- Memory overhead of maintaining both approximate and exact structures?
Approximation is more like a drop in replacement for regular queries since it doesn't impact correctness of queries. In my opinion, If we don't see regression in query performance I think it makes sense to make this a default. There is a very minimal memory overhead for approximate queries since both the original query and approximate version are stored until execution but this is far outweighed by the performance benefit.
Approximation has correctness bounds - users are trading off accuracy for speed. It should be made clear and should be established by benchmarks.
Benchmark methodology is unclear - what dataset size/cardinality? Need to validate performance doesn't regress for small result sets where approximation overhead might dominate.
Yeah, that's my bad I wasn't super clear in my comment. I ran this benchmark on the http_logs dataset which is a relatively large dataset (~31g). That's a valid concern.
ApproximatePointRangeQueryactually does contain a small optimization where it will fall back topointRangeQueryWeight.scorerSupplier(context)ifsize > values.size(). Meaning for small datasets (less than 10k points), approximation is not utilized.
http_logs is timestamp dependent. Consider benchmarking on datasets with patterns:
- Random integer distributions
- Zipfian access patterns
- Actually measure the approximation error rates
| lastValue = randomValue; | ||
| } | ||
| // Add DocValues field using the last value (required for sorting) | ||
| numericType.addDocValuesField(doc, numericType.fieldName, lastValue); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DocValues field is random and unrelated to the point values. If sorting is used anywhere in the test chain, this creates inconsistent behavior. Is this intended?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm interesting, not sure why I was using DocValues there. Yeah, we should just be dealing with regular point values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests in ApproximatePointRangeQueryTests index to both BKD and doc_values.
numericType.addField(doc, numericType.fieldName, i);
numericType.addDocValuesField(doc, numericType.fieldName, i);
This is because the Sort sort = new Sort(new SortField(numericType.getSortFieldName(), numericType.getSortFieldType()));, the SortField type does not directly support half_float and unsigned_long
So to validate the results (same docs are returned) for half_float and unsigned_long we compare against the sort ran on doc_values (instead of BKD). If we run sorts on same field that is indexed in BKD we get the error as:
java.lang.IllegalArgumentException: Field half_float_field is indexed with 2 bytes per dimension, but org.apache.lucene.search.comparators.LongComparator@505e3f7e expected 8
Yeah I agree with you, approximation within bool queries is definitely a big gap in our current framework. Here are the reasons we haven't done so already:
So approximation in terms of our current implementation is actually a bit of a misnomer. We don't make any tradeoffs in accuracy for speedup. See this issue and pr.
Will do, thanks for the suggestions.
As explained above, there is no error rate. |
| ); | ||
| } | ||
|
|
||
| // If there are no matching documents, skip the sort tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to test this? if we index right range of docs do we ever end up in the scenario?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that scenario never occurs in the tests.
|
@sawansri can you please check if have this query shape in our workloads https://github.com/opensearch-project/opensearch-benchmark-workloads/, we can run the benchmarks agains this PR. |
| Query dvQuery = SortedNumericDocValuesField.newSlowExactQuery("scaled_float", scaledValue); | ||
| Query query = new IndexOrDocValuesQuery(LongPoint.newExactQuery("scaled_float", scaledValue), dvQuery); | ||
| assertEquals(query, ft.termQuery(value, null)); | ||
| Query expectedQuery = new ApproximateScoreQuery( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we dont change anything in ScaledFloatFieldMapper because the termQuery uses the NumberFieldMapper.NumberType.LONG.termQuery
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, ScaledFloatFieldMapper implements a termQuery that uses NumberFieldMapper.NumberType.LONG.termQuery() which would fall under approximation.
I don't see numeric term queries in any of the workloads. I'll probably have to run some benchmarks locally. |
Thanks and this is not a blocker for this PR, may be would consider adding it, here is the sample PR opensearch-project/opensearch-benchmark-workloads#674 which I created in past to onboard new queries that are improved with approximation. |
|
Ran some more benchmarks on Big5 (more balanced dataset): Before Approximation After Approximation Query: Looks like we are seeing a slight latency drop. This makes sense since we expect to see the most benefit out of approximation on datasets where 10k hits can be reached early on in the BKD traversal (very common for range queries). |
Signed-off-by: Sawan Srivastava <sawan1210@gmail.com>
Signed-off-by: Sawan Srivastava <sawan1210@gmail.com>
…RangeQuery Signed-off-by: Sawan Srivastava <sawan1210@gmail.com>
Signed-off-by: Sawan Srivastava <sawan1210@gmail.com>
Signed-off-by: Sawan Srivastava <sawan1210@gmail.com>
…on structure Signed-off-by: Sawan Srivastava <sawan1210@gmail.com>
…lderTests Signed-off-by: Sawan Srivastava <sawan1210@gmail.com>
Signed-off-by: Sawan Srivastava <sawan1210@gmail.com>
Signed-off-by: Sawan Srivastava <sawan1210@gmail.com>
Signed-off-by: Sawan Srivastava <sawan1210@gmail.com>
| public Scorer get(long leadCost) throws IOException { | ||
| intersectRight(values.getPointTree(), visitor, docCount); | ||
| // Force term queries with desc sort through intersectLeft | ||
| if (Arrays.equals(pointRangeQuery.getLowerPoint(), pointRangeQuery.getUpperPoint())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does it work with ?
curl -X POST "http://localhost:9200/logs-*/_search" -H "Content-Type: application/json" -d '
{
"query": {
"term": {
"status": {
"value": 200
}
}
},
"sort": [
{
"@timestamp": {
"order": "desc"
}
}
]
}
' | jq '.'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today we can run term with sort, I want to make sure the results does not break,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a good change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does it work with ?
curl -X POST "http://localhost:9200/logs-*/_search" -H "Content-Type: application/json" -d ' { "query": { "term": { "status": { "value": 200 } } }, "sort": [ { "@timestamp": { "order": "desc" } } ] } ' | jq '.'
This change doesn't affect term queries where the primary sort field is different from the term query's field. In ApproximatePointRangeQuery's canApproximate, this statement gets executed.
if (!primarySortField.fieldName().equals(pointRangeQuery.getField())) {
return false;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there is a singular unique point value in a numeric term query and the tiebreaker ends up being the docID, using intersectLeft results in the same top-k results as intersectRight.
I'm also fine in disabling approximation for this specific case (term query with descending sort) due to performance concerns (queue has more comparisons to make) until intersectRight logic can be modified to return accurate hits for these kinds of queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is risky, can leave out this change and create an issue for it? I'm to merge in incremental changes, given there's enough test coverage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's what I was thinking as well. For the time being, we could disable approximation and send desc sorts on top level term queries through Lucene while this edge case is being sorted out.
| } | ||
|
|
||
| ApproximatePointRangeQuery approxQuery = new ApproximatePointRangeQuery(field, upperBytes, lowerBytes, dims, size, null, format); | ||
| ApproximatePointRangeQuery approxQuery = new ApproximatePointRangeQuery(field, lowerBytes, upperBytes, dims, size, null, format); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good find, but now I'm worried why didn't tests fail previously?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, that's a good point. If the bounds are swapped, the query should be rewritten into a match no docs query. Would need to look into why the tests weren't failing previously.
| public Scorer get(long leadCost) throws IOException { | ||
| intersectRight(values.getPointTree(), visitor, docCount); | ||
| // Force term queries with desc sort through intersectLeft | ||
| if (Arrays.equals(pointRangeQuery.getLowerPoint(), pointRangeQuery.getUpperPoint())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is risky, can leave out this change and create an issue for it? I'm to merge in incremental changes, given there's enough test coverage.
| try (RandomIndexWriter iw = new RandomIndexWriter(random(), directory, new WhitespaceAnalyzer())) { | ||
| int dims = 1; | ||
| int numDocs = 15000; | ||
| int targetValue = 1520; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should use random instead to get coverage for both postive and negative values? Or a selected set of targets, e.g. -neg, 0, pos, max, min? Maintainers, what the general guideline?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I can probably expand the test to change numDocs and targetValue to utilize random and include negative values, 0, max, min, etc.
|
Just a follow up on this PR @sawansri, can you please check if you can limit the changes to ApproximatePointRangeQuery and have the top level term queries use the Approximation path ? |
Hey @prudhvigodithi , just to confirm, this would mean disabling approximation on top-level term queries with descending sort while letting term queries without sort and asc sort go through approximation. If this is the case, we would no longer need to force desc sort through |
I see, can the existing intersectRight work for the term queries when used desc sort ? Did I miss any thread? |
|
This PR is stalled because it has been open for 30 days with no activity. |
Description
Enables approximation in top level term queries on numeric fields by wrapping
termQuerymappers inNumberFieldMapperwithApproximatePointRangeQueryinstead of Lucene'snewExactQueryRelated Issues
Resolves #18620
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.