Add search support for handling deleted documents#21840
Conversation
PR Reviewer Guide 🔍(Review updated until commit 383db8c)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to 383db8c
Previous suggestionsSuggestions up to commit 2b454d0
|
Signed-off-by: RS146BIJAY <rishavsagar4b1@gmail.com>
2b454d0 to
383db8c
Compare
|
Persistent review updated to latest commit 383db8c |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #21840 +/- ##
============================================
+ Coverage 73.33% 73.43% +0.09%
- Complexity 75329 75420 +91
============================================
Files 6032 6032
Lines 342355 342389 +34
Branches 49229 49234 +5
============================================
+ Hits 251078 251436 +358
+ Misses 71327 70959 -368
- Partials 19950 19994 +44 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Description
For a Composite format where one of the format is Lucene (or lucene only format), deleted documents are tracked by Lucene's
liveDocsbitmap. When a query's filter contains at least one predicate naturally served by Lucene, that predicate'sWeight.scorer(...).iterator()already excludes deletes — so deleted docs are filtered out.When every predicate is Parquet-native (DataFusion drives the scan, no Lucene delegation occurs), there is no Lucene-produced bitset to AND with the row-group survivor mask. Parquet has no concept of soft-delete, so deleted documents leak into query results.
The fix: inject a synthetic Lucene-bound
MATCHALL()annotation under a top-level AND so at least one LuceneWeight.scorerruns per row group, producing a live-docs bitset that gets ANDed into the survivor mask.