Skip to content

Add search support for handling deleted documents#21840

Open
RS146BIJAY wants to merge 1 commit into
opensearch-project:mainfrom
RS146BIJAY:updates-search
Open

Add search support for handling deleted documents#21840
RS146BIJAY wants to merge 1 commit into
opensearch-project:mainfrom
RS146BIJAY:updates-search

Conversation

@RS146BIJAY
Copy link
Copy Markdown
Contributor

@RS146BIJAY RS146BIJAY commented May 26, 2026

Description

For a Composite format where one of the format is Lucene (or lucene only format), deleted documents are tracked by Lucene's liveDocs bitmap. When a query's filter contains at least one predicate naturally served by Lucene, that predicate's Weight.scorer(...).iterator() already excludes deletes — so deleted docs are filtered out.

When every predicate is Parquet-native (DataFusion drives the scan, no Lucene delegation occurs), there is no Lucene-produced bitset to AND with the row-group survivor mask. Parquet has no concept of soft-delete, so deleted documents leak into query results.

The fix: inject a synthetic Lucene-bound MATCHALL() annotation under a top-level AND so at least one Lucene Weight.scorer runs per row group, producing a live-docs bitset that gets ANDed into the survivor mask.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 26, 2026

PR Reviewer Guide 🔍

(Review updated until commit 383db8c)

Here are some key observations to aid the review process:

🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ Recommended focus areas for review

Incorrect Comparison

Line 238 uses liveDocs.get(docId) == true which is redundant and less idiomatic. While functionally correct, the explicit == true comparison is unnecessary for boolean expressions. More critically, if liveDocs.get(docId) ever returns a boxed Boolean that is null (though unlikely with Bits interface), this could throw a NullPointerException, whereas a direct boolean check would not.

if (liveDocs == null || liveDocs.get(docId) == true) {
Logic Error

Line 180 checks if any leaf has viable.size() == 1 && LUCENE_BACKEND_NAME.equals(viable.get(0)) and returns early if true. This appears inverted: the method should inject MATCHALL when NO leaf is Lucene-eligible, but this guard returns early when a leaf IS Lucene-only. The correct logic should check if all leaves are DataFusion-only (not Lucene), but this condition exits when finding a Lucene-only leaf, which contradicts the stated goal in Guard 2's comment.

for (AnnotatedPredicate leaf : leaves) {
    List<String> viable = leaf.getViableBackends();
    if (viable.size() == 1 && LUCENE_BACKEND_NAME.equals(viable.get(0))) {
        return annotatedCondition;
    }
}

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 26, 2026

PR Code Suggestions ✨

Latest suggestions up to 383db8c
Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Remove redundant boolean comparison

Remove the explicit == true comparison when checking boolean values. In Java,
liveDocs.get(docId) already returns a boolean, making the comparison redundant and
less idiomatic.

sandbox/plugins/analytics-backend-lucene/src/main/java/org/opensearch/be/lucene/LuceneFilterDelegationHandle.java [238-240]

-if (liveDocs == null || liveDocs.get(docId) == true) {
+if (liveDocs == null || liveDocs.get(docId)) {
     bits.set(docId - minDoc);
 }
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies redundant == true comparison. While the code is functionally correct, removing the explicit comparison improves code style and readability, making it more idiomatic Java.

Low

Previous suggestions

Suggestions up to commit 2b454d0
CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix inverted guard logic condition

The logic incorrectly returns early when finding a Lucene-eligible leaf. The guard
should verify that ALL leaves are DataFusion-only before injecting the MATCHALL
clause. If any leaf is Lucene-eligible, skip injection. The current condition
inverts this logic.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/planner/rules/OpenSearchFilterRule.java [178-183]

 for (AnnotatedPredicate leaf : leaves) {
     List<String> viable = leaf.getViableBackends();
-    if (viable.size() == 1 && LUCENE_BACKEND_NAME.equals(viable.get(0))) {
+    if (!(viable.size() == 1 && "datafusion".equals(viable.get(0)))) {
         return annotatedCondition;
     }
 }
Suggestion importance[1-10]: 10

__

Why: The current logic is critically flawed. The guard should skip injection if ANY leaf is Lucene-eligible, but instead returns early when finding a Lucene-eligible leaf. The comment states "every annotated leaf...has viableBackends == [datafusion]" but the code checks for LUCENE_BACKEND_NAME. This is a major bug that breaks the intended live-docs filtering logic.

High
Remove incorrect empty leaves guard

When leaves is empty (no predicates), the method should still inject the MATCHALL
clause if Lucene is configured, since there's no natural Lucene predicate to filter
deleted documents. Returning early here bypasses necessary live-docs filtering for
queries without user predicates.

sandbox/plugins/analytics-engine/src/main/java/org/opensearch/analytics/planner/rules/OpenSearchFilterRule.java [173-177]

 List<AnnotatedPredicate> leaves = new ArrayList<>();
 collectAnnotatedPredicates(annotatedCondition, leaves);
-if (leaves.isEmpty()) {
-    return annotatedCondition;
-}
Suggestion importance[1-10]: 9

__

Why: When leaves is empty (no user predicates), the MATCHALL clause should still be injected for Lucene-configured shards to ensure live-docs filtering. The early return bypasses this critical filtering mechanism for queries without predicates, which is a significant correctness issue.

High

Signed-off-by: RS146BIJAY <rishavsagar4b1@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit 383db8c

@RS146BIJAY RS146BIJAY marked this pull request as ready for review May 26, 2026 23:04
@RS146BIJAY RS146BIJAY requested a review from a team as a code owner May 26, 2026 23:04
@github-actions
Copy link
Copy Markdown
Contributor

✅ Gradle check result for 383db8c: SUCCESS

@codecov
Copy link
Copy Markdown

codecov Bot commented May 26, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.43%. Comparing base (b944005) to head (383db8c).
⚠️ Report is 11 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #21840      +/-   ##
============================================
+ Coverage     73.33%   73.43%   +0.09%     
- Complexity    75329    75420      +91     
============================================
  Files          6032     6032              
  Lines        342355   342389      +34     
  Branches      49229    49234       +5     
============================================
+ Hits         251078   251436     +358     
+ Misses        71327    70959     -368     
- Partials      19950    19994      +44     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant