Skip to content

Conversation

@sawansri
Copy link
Contributor

@sawansri sawansri commented Jul 3, 2025

Description

Enables approximation in top level term queries on numeric fields by wrapping termQuery mappers in NumberFieldMapper with ApproximatePointRangeQuery instead of Lucene's newExactQuery

Related Issues

Resolves #18620

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions github-actions bot added enhancement Enhancement or improvement to existing feature or request Search:Performance labels Jul 3, 2025
@sawansri
Copy link
Contributor Author

sawansri commented Jul 3, 2025

I'm waiting on #18530 to get merged before rebasing and refactoring my tests to integrate with the parameterized versions in ApproximatePointRangeQueryTests.

@sawansri sawansri changed the title [Approximation Framework] Enabling approximation on top-level term queries on numeric fields [Approximation Framework] Extending approximation to top-level term queries on numeric fields Jul 3, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Jul 3, 2025

❌ Gradle check result for e8e55c4: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

github-actions bot commented Jul 3, 2025

❌ Gradle check result for ce6db1c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Contributor

github-actions bot commented Jul 4, 2025

✅ Gradle check result for 6582d5e: SUCCESS

@codecov
Copy link

codecov bot commented Jul 4, 2025

Codecov Report

❌ Patch coverage is 64.17910% with 24 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.72%. Comparing base (9b79f19) to head (5fce0a8).
⚠️ Report is 245 commits behind head on main.

Files with missing lines Patch % Lines
...org/opensearch/index/mapper/NumberFieldMapper.java 62.50% 12 Missing and 12 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #18679      +/-   ##
============================================
- Coverage     72.86%   72.72%   -0.15%     
+ Complexity    68971    68820     -151     
============================================
  Files          5601     5601              
  Lines        316435   316465      +30     
  Branches      45880    45887       +7     
============================================
- Hits         230586   230155     -431     
- Misses        67218    67656     +438     
- Partials      18631    18654      +23     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link
Contributor

github-actions bot commented Jul 6, 2025

✅ Gradle check result for 067933a: SUCCESS

Copy link
Contributor

@atris atris left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ApproximateTermQuery wrapping looks clean. Two concerns:

  1. Does this maintain correct scoring semantics for numeric term queries?
  2. How does approximation error propagate in compound queries containing these terms?

So I'm not actually implementing ApproximateTermQuery (it would introduce unnecessary overhead). Instead, I'm simply replacing Lucene's newExactQuery (internally a PointRangeQuery on the same upper and lower field) with an equivalent ApproximatePointRangeQuery in NumberFieldMapper.

  1. Scoring should not be impacted for term queries on numeric fields since they are constant score queries. Also, a PointRangeQuery and ApproximatePointRangeQuery are identical scoring wise.

Makes sense.

  1. Approximation is not currently applied for any clause in a boolean query or any case where it isn't a top level query. Result inaccuracy should not be an issue here.

This is an interesting point. Is there an architectural limitation due to which this is not possible? If NumberFieldMapper.termQuery() returns ApproximatePointRangeQuery, why wouldn't it work inside a bool query? The query rewriting happens at the leaf level.

IMO this is a limitation that needs to be fixed to a wider real world applicability.

Parameterized tests are good but some edge cases to consider:

  • What happens with numeric overflow scenarios?
  • Are negative numbers handled correctly in the approximation?
  • Need explicit test for approximation vs exact result validation.

I believe existing tests already validate approximate vs exact results. See testApproximateVsExactQuery in ApproximatePointRangeQueryTests. Edge cases like numeric overflows and negative numbers are handled appropriately in ApproximatePointRangeQuery.

Nice integration with existing approximation framework. Consider:

  • Should this be configurable per query? Some use cases may require exact numeric matching.
  • Memory overhead of maintaining both approximate and exact structures?

Approximation is more like a drop in replacement for regular queries since it doesn't impact correctness of queries. In my opinion, If we don't see regression in query performance I think it makes sense to make this a default. There is a very minimal memory overhead for approximate queries since both the original query and approximate version are stored until execution but this is far outweighed by the performance benefit.

Approximation has correctness bounds - users are trading off accuracy for speed. It should be made clear and should be established by benchmarks.

Benchmark methodology is unclear - what dataset size/cardinality? Need to validate performance doesn't regress for small result sets where approximation overhead might dominate.

Yeah, that's my bad I wasn't super clear in my comment. I ran this benchmark on the http_logs dataset which is a relatively large dataset (~31g). That's a valid concern. ApproximatePointRangeQuery actually does contain a small optimization where it will fall back to pointRangeQueryWeight.scorerSupplier(context) if size > values.size(). Meaning for small datasets (less than 10k points), approximation is not utilized.

http_logs is timestamp dependent. Consider benchmarking on datasets with patterns:

  • Random integer distributions
  • Zipfian access patterns
  • Actually measure the approximation error rates

lastValue = randomValue;
}
// Add DocValues field using the last value (required for sorting)
numericType.addDocValuesField(doc, numericType.fieldName, lastValue);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DocValues field is random and unrelated to the point values. If sorting is used anywhere in the test chain, this creates inconsistent behavior. Is this intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm interesting, not sure why I was using DocValues there. Yeah, we should just be dealing with regular point values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests in ApproximatePointRangeQueryTests index to both BKD and doc_values.

numericType.addField(doc, numericType.fieldName, i);
numericType.addDocValuesField(doc, numericType.fieldName, i);

This is because the Sort sort = new Sort(new SortField(numericType.getSortFieldName(), numericType.getSortFieldType()));, the SortField type does not directly support half_float and unsigned_long

So to validate the results (same docs are returned) for half_float and unsigned_long we compare against the sort ran on doc_values (instead of BKD). If we run sorts on same field that is indexed in BKD we get the error as:

java.lang.IllegalArgumentException: Field half_float_field is indexed with 2 bytes per dimension, but org.apache.lucene.search.comparators.LongComparator@505e3f7e expected 8

@sawansri
Copy link
Contributor Author

sawansri commented Aug 4, 2025

This is an interesting point. Is there an architectural limitation due to which this is not possible? If NumberFieldMapper.termQuery() returns ApproximatePointRangeQuery, why wouldn't it work inside a bool query? The query rewriting happens at the leaf level.

IMO this is a limitation that needs to be fixed to a wider real world applicability.

Yeah I agree with you, approximation within bool queries is definitely a big gap in our current framework. Here are the reasons we haven't done so already:

  1. Conjunction Queries (FILTER and MUST) - All clauses must be constant score for us to even consider applying approximation (if we terminate at 10k hits, there is no guarantee that a later hit/conjunction combo scores higher). Additionally, even if all clauses are constant scoring, there is no guarantee that the conjunction of them will return 10k hits, hence why I proposed this RFC.

  2. Disjunction Queries (SHOULD and MUST_NOT) - Even if clauses are constant scoring, scores of clauses are summed essentially making the query scoring. We run into the same problem approximating here as conjunction queries that have scoring clauses.

Approximation has correctness bounds - users are trading off accuracy for speed. It should be made clear and should be established by benchmarks.

So approximation in terms of our current implementation is actually a bit of a misnomer. We don't make any tradeoffs in accuracy for speedup. See this issue and pr.

http_logs is timestamp dependent. Consider benchmarking on datasets with patterns:

  • Random integer distributions
  • Zipfian access patterns

Will do, thanks for the suggestions.

  • Actually measure the approximation error rates

As explained above, there is no error rate.

);
}

// If there are no matching documents, skip the sort tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to test this? if we index right range of docs do we ever end up in the scenario?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that scenario never occurs in the tests.

@prudhvigodithi
Copy link
Member

prudhvigodithi commented Aug 4, 2025

@sawansri can you please check if have this query shape in our workloads https://github.com/opensearch-project/opensearch-benchmark-workloads/, we can run the benchmarks agains this PR.

Query dvQuery = SortedNumericDocValuesField.newSlowExactQuery("scaled_float", scaledValue);
Query query = new IndexOrDocValuesQuery(LongPoint.newExactQuery("scaled_float", scaledValue), dvQuery);
assertEquals(query, ft.termQuery(value, null));
Query expectedQuery = new ApproximateScoreQuery(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we dont change anything in ScaledFloatFieldMapper because the termQuery uses the NumberFieldMapper.NumberType.LONG.termQuery

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, ScaledFloatFieldMapper implements a termQuery that uses NumberFieldMapper.NumberType.LONG.termQuery() which would fall under approximation.

@sawansri
Copy link
Contributor Author

sawansri commented Aug 4, 2025

@sawansri can you please check if have this query shape in our workloads opensearch-project/opensearch-benchmark-workloads, we can run the benchmarks agains this PR.

I don't see numeric term queries in any of the workloads. I'll probably have to run some benchmarks locally.

@prudhvigodithi
Copy link
Member

prudhvigodithi commented Aug 4, 2025

I don't see numeric term queries in any of the workloads. I'll probably have to run some benchmarks locally.

Thanks and this is not a blocker for this PR, may be would consider adding it, here is the sample PR opensearch-project/opensearch-benchmark-workloads#674 which I created in past to onboard new queries that are improved with approximation.

@sawansri
Copy link
Contributor Author

sawansri commented Aug 4, 2025

Ran some more benchmarks on Big5 (more balanced dataset):

Before Approximation

| Min Throughput | term-numeric | 1 | ops/s |
| Mean Throughput | term-numeric | 1.01 | ops/s |
| Median Throughput | term-numeric | 1.01 | ops/s |
| Max Throughput | term-numeric | 1.01 | ops/s |
| 50th percentile latency | term-numeric | 8.66894 | ms |
| 90th percentile latency | term-numeric | 9.52163 | ms |
| 99th percentile latency | term-numeric | 10.4394 | ms |
| 100th percentile latency | term-numeric | 11.7775 | ms |
| 50th percentile service time | term-numeric | 6.93358 | ms |
| 90th percentile service time | term-numeric | 7.49459 | ms |
| 99th percentile service time | term-numeric | 8.30141 | ms |
| 100th percentile service time | term-numeric | 9.75295 | ms |
| error rate | term-numeric | 0 | % |

After Approximation

| Min Throughput | term-numeric | 1 | ops/s |
| Mean Throughput | term-numeric | 1.01 | ops/s |
| Median Throughput | term-numeric | 1.01 | ops/s |
| Max Throughput | term-numeric | 1.01 | ops/s |
| 50th percentile latency | term-numeric | 8.53285 | ms |
| 90th percentile latency | term-numeric | 9.39101 | ms |
| 99th percentile latency | term-numeric | 10.4966 | ms |
| 100th percentile latency | term-numeric | 10.4971 | ms |
| 50th percentile service time | term-numeric | 6.77809 | ms |
| 90th percentile service time | term-numeric | 7.36977 | ms |
| 99th percentile service time | term-numeric | 8.40853 | ms |
| 100th percentile service time | term-numeric | 8.61667 | ms |
| error rate | term-numeric | 0 | % |

Query:

curl -X POST "http://localhost:9200/big5/_search" -H "Content-Type: application/json" -d '
{
    "query": 
        {
        "term": {
        "metrics.size": {
            "value": 1500
            }
        }
    }
}
' | jq '.'

Looks like we are seeing a slight latency drop. This makes sense since we expect to see the most benefit out of approximation on datasets where 10k hits can be reached early on in the BKD traversal (very common for range queries).

sawansri added 10 commits August 5, 2025 17:53
Signed-off-by: Sawan Srivastava <sawan1210@gmail.com>
Signed-off-by: Sawan Srivastava <sawan1210@gmail.com>
…RangeQuery

Signed-off-by: Sawan Srivastava <sawan1210@gmail.com>
Signed-off-by: Sawan Srivastava <sawan1210@gmail.com>
Signed-off-by: Sawan Srivastava <sawan1210@gmail.com>
…on structure

Signed-off-by: Sawan Srivastava <sawan1210@gmail.com>
…lderTests

Signed-off-by: Sawan Srivastava <sawan1210@gmail.com>
Signed-off-by: Sawan Srivastava <sawan1210@gmail.com>
Signed-off-by: Sawan Srivastava <sawan1210@gmail.com>
Signed-off-by: Sawan Srivastava <sawan1210@gmail.com>
@prudhvigodithi
Copy link
Member

Coming from #18334 issue Consider extending ApproximatePointRangeQuery to term queries. For low-cardinality field, this could yield better optimization results. there was a request to extend for term queries, adding @kkewwei for some feedback

public Scorer get(long leadCost) throws IOException {
intersectRight(values.getPointTree(), visitor, docCount);
// Force term queries with desc sort through intersectLeft
if (Arrays.equals(pointRangeQuery.getLowerPoint(), pointRangeQuery.getUpperPoint())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it work with ?

curl -X POST "http://localhost:9200/logs-*/_search" -H "Content-Type: application/json" -d '
{
  "query": {
    "term": {
      "status": {
        "value": 200
      }
    }
  },
  "sort": [
    {
      "@timestamp": {
        "order": "desc"
      }
    }
  ]
}
' | jq '.'

Copy link
Member

@prudhvigodithi prudhvigodithi Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today we can run term with sort, I want to make sure the results does not break,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a good change

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it work with ?

curl -X POST "http://localhost:9200/logs-*/_search" -H "Content-Type: application/json" -d '
{
  "query": {
    "term": {
      "status": {
        "value": 200
      }
    }
  },
  "sort": [
    {
      "@timestamp": {
        "order": "desc"
      }
    }
  ]
}
' | jq '.'

This change doesn't affect term queries where the primary sort field is different from the term query's field. In ApproximatePointRangeQuery's canApproximate, this statement gets executed.

if (!primarySortField.fieldName().equals(pointRangeQuery.getField())) {
                    return false;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is a singular unique point value in a numeric term query and the tiebreaker ends up being the docID, using intersectLeft results in the same top-k results as intersectRight.

I'm also fine in disabling approximation for this specific case (term query with descending sort) due to performance concerns (queue has more comparisons to make) until intersectRight logic can be modified to return accurate hits for these kinds of queries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is risky, can leave out this change and create an issue for it? I'm to merge in incremental changes, given there's enough test coverage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's what I was thinking as well. For the time being, we could disable approximation and send desc sorts on top level term queries through Lucene while this edge case is being sorted out.

@github-actions
Copy link
Contributor

github-actions bot commented Aug 5, 2025

✅ Gradle check result for 5fce0a8: SUCCESS

}

ApproximatePointRangeQuery approxQuery = new ApproximatePointRangeQuery(field, upperBytes, lowerBytes, dims, size, null, format);
ApproximatePointRangeQuery approxQuery = new ApproximatePointRangeQuery(field, lowerBytes, upperBytes, dims, size, null, format);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find, but now I'm worried why didn't tests fail previously?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, that's a good point. If the bounds are swapped, the query should be rewritten into a match no docs query. Would need to look into why the tests weren't failing previously.

public Scorer get(long leadCost) throws IOException {
intersectRight(values.getPointTree(), visitor, docCount);
// Force term queries with desc sort through intersectLeft
if (Arrays.equals(pointRangeQuery.getLowerPoint(), pointRangeQuery.getUpperPoint())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is risky, can leave out this change and create an issue for it? I'm to merge in incremental changes, given there's enough test coverage.

try (RandomIndexWriter iw = new RandomIndexWriter(random(), directory, new WhitespaceAnalyzer())) {
int dims = 1;
int numDocs = 15000;
int targetValue = 1520;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should use random instead to get coverage for both postive and negative values? Or a selected set of targets, e.g. -neg, 0, pos, max, min? Maintainers, what the general guideline?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I can probably expand the test to change numDocs and targetValue to utilize random and include negative values, 0, max, min, etc.

@prudhvigodithi
Copy link
Member

Just a follow up on this PR @sawansri, can you please check if you can limit the changes to ApproximatePointRangeQuery and have the top level term queries use the Approximation path ?
Thanks

@sawansri
Copy link
Contributor Author

sawansri commented Sep 5, 2025

Just a follow up on this PR @sawansri, can you please check if you can limit the changes to ApproximatePointRangeQuery and have the top level term queries use the Approximation path ? Thanks

Hey @prudhvigodithi , just to confirm, this would mean disabling approximation on top-level term queries with descending sort while letting term queries without sort and asc sort go through approximation. If this is the case, we would no longer need to force desc sort through intersectLeft.

@prudhvigodithi
Copy link
Member

Hey @prudhvigodithi , just to confirm, this would mean disabling approximation on top-level term queries with descending sort while letting term queries without sort and asc sort go through approximation.

I see, can the existing intersectRight work for the term queries when used desc sort ? Did I miss any thread?

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

@opensearch-trigger-bot opensearch-trigger-bot bot added stalled Issues that have stalled and removed stalled Issues that have stalled labels Oct 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Enhancement or improvement to existing feature or request lucene Search:Performance

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

[Approximation Framework] Extend approximation to top-level term queries on numeric fields

5 participants