- 
                Notifications
    You must be signed in to change notification settings 
- Fork 2.3k
Add HashSet based filtering optimization to XContentMapValues #17160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add HashSet based filtering optimization to XContentMapValues #17160
Conversation
This optimization enhances document filtering when field names are simple (no dots or wildcards in field names and no dots in document keys). In such cases, it uses a HashSet-based implementation instead of automaton matching to prevent TooComplexToDeterminizeException when processing documents with numerous long field names. Changes: - Add HashSet optimization for simple field names - Split filter implementation into set-based and automaton-based - Add helper methods to check field name patterns Signed-off-by: hye-on <ain0103@naver.com>
| Codecov ReportAttention: Patch coverage is  
 
 Additional details and impacted files@@             Coverage Diff              @@
##               main   #17160      +/-   ##
============================================
+ Coverage     72.40%   72.47%   +0.06%     
- Complexity    65554    65585      +31     
============================================
  Files          5292     5292              
  Lines        304493   304517      +24     
  Branches      44218    44224       +6     
============================================
+ Hits         220463   220689     +226     
+ Misses        65975    65760     -215     
- Partials      18055    18068      +13     ☔ View full report in Codecov by Sentry. | 
        
          
                server/src/main/java/org/opensearch/common/xcontent/support/XContentMapValues.java
          
            Show resolved
            Hide resolved
        
      …dots Signed-off-by: hye-on <ain0103@naver.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @hye-on ! This looks great.
I checked the code coverage report and the new code is well-covered the existing tests, which is nice.
We've missed the code freeze date for 2.19, so I think this will ship with 3.0, which we're currently planning as the next release.
Can you please add an entry to the CHANGELOG-3.0.md file with an end-user-friendly description of the change? Maybe something like "Use simpler matching logic for source fields when explicit field names (no wildcards or dot-paths) are specified".
Signed-off-by: hye-on <ain0103@naver.com>
| @msfroh Sorry for the late response! Thanks for the suggested changelog entry—I’ve added it! | 
| ❌ Gradle check result for dd9749e: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? | 
Signed-off-by: Michael Froh <froh@amazon.com>
Signed-off-by: Michael Froh <froh@amazon.com>
| ❌ Gradle check result for 833835e: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? | 
| Thanks a lot for fixing this @hye-on ! | 
Description
This optimization enhances document filtering when field names are simple (no dots or wildcards in field names and no dots in document keys). In such cases, it uses a HashSet-based implementation instead of automaton matching to prevent TooComplexToDeterminizeException when processing documents with numerous long field names.
Changes:
Reason for Checking Dots in Document Keys
Since dots in field names are treated as sub-objects (e.g., if a document contains
a.bas a property andais an include, thena.bwill be kept in the filtered map as per existing JavaDoc below), we check document keys for dots to determine whether to use HashSet-based filtering or fall back to automaton matching.JavaDoc
Note on Testing
Thank you for taking the time to review this PR.
This change focuses on internal optimization, and I believe the existing tests already sufficiently cover the functional scenarios, so I did not add new tests. I considered including tests to verify the implementation selection (HashSet vs. automaton), but I felt such tests might be too closely tied to implementation details, potentially making them fragile. If you believe additional tests are necessary, I would greatly appreciate your feedback.
Comment
I made every effort to ensure the changes align with the style and intent of the existing codebase. If you notice any areas, even minor ones, that could benefit from revision, I would be happy to incorporate your suggestions.
Thank you again for your time and guidance! 😀
Related Issues
Resolves #17114
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.