Skip to content

Add segment end criteria check for SVForwardIndex and Dictionary#15120

Merged
Jackie-Jiang merged 7 commits intoapache:masterfrom
lnbest0707:upstream_fork_4G
May 5, 2025
Merged

Add segment end criteria check for SVForwardIndex and Dictionary#15120
Jackie-Jiang merged 7 commits intoapache:masterfrom
lnbest0707:upstream_fork_4G

Conversation

@lnbest0707
Copy link
Contributor

enhancement ingestion
Inspired by #14479
Apart from the MVForwardIndex check added in that PR, this PR adds similar check for SVForwardIndex and Dictionary. The PR only adds the interface and UTs.
The actual check policy is not within this PR and could be an open question to discuss. The end criteria needs to consider

  • ForwardIndex would end in a 4GB limit once converted into immutable segment.
  • Dictionary has its cardinality limit

While it is tricky to make it correct as:

  • 4GB limit is a compressed size, during ingestion into mutable segment, it cannot predict the correct compression ratio
  • With optimizeDictionary enabled, Dictionary encoding could also end in immutable forward index. It would be even harder to guess
    - If Dictionary would be converted to ForwardIndex
    - If converted, what would be the final compressed size

Some proposed policies:

  • Make the uncompressed size limit configurable, relying on that value to set the mutable segment's threshold.
  • Use the last segment's compression ratio as reference to predict current segment's.
  • For dictionary, use the same policy (but maybe a larger number by config) as the forward index.

@codecov-commenter
Copy link

codecov-commenter commented Feb 24, 2025

Codecov Report

Attention: Patch coverage is 86.66667% with 6 lines in your changes missing coverage. Please review.

Project coverage is 62.86%. Comparing base (f99e9fb) to head (2198391).
Report is 21 commits behind head on master.

Files with missing lines Patch % Lines
...local/indexsegment/mutable/MutableSegmentImpl.java 88.63% 1 Missing and 4 partials ⚠️
...t/segment/spi/index/mutable/MutableDictionary.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #15120      +/-   ##
============================================
+ Coverage     62.78%   62.86%   +0.07%     
- Complexity     1382     1384       +2     
============================================
  Files          2863     2866       +3     
  Lines        162780   163287     +507     
  Branches      24915    24936      +21     
============================================
+ Hits         102203   102648     +445     
- Misses        52871    52922      +51     
- Partials       7706     7717      +11     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 62.81% <86.66%> (+0.06%) ⬆️
java-21 62.84% <86.66%> (+0.07%) ⬆️
skip-bytebuffers-false 62.84% <86.66%> (+0.06%) ⬆️
skip-bytebuffers-true 62.82% <86.66%> (+0.09%) ⬆️
temurin 62.86% <86.66%> (+0.07%) ⬆️
unittests 62.85% <86.66%> (+0.07%) ⬆️
unittests1 55.87% <20.00%> (+0.13%) ⬆️
unittests2 33.51% <86.66%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@Jackie-Jiang
Copy link
Contributor

@noob-se7en @KKcorps Can you help take a look?

Copy link
Contributor

@noob-se7en noob-se7en left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

lnbest0707 and others added 3 commits April 24, 2025 13:37
Co-authored-by: NOOB <43700604+noob-se7en@users.noreply.github.com>
@lnbest0707
Copy link
Contributor Author

@noob-se7en resolved you comments, could you pls review again? Thanks

// the valid doc ids won't be updated.
private final ThreadSafeMutableRoaringBitmap _validDocIds;
private final ThreadSafeMutableRoaringBitmap _queryableDocIds;
private boolean _indexCapacityThresholdBreached;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this threshold check for all indexes/dictionaries in one segment?

Suggest to rename to _indexThresholdReached and also add javadoc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is one for all check. The PR is only extracting the variable to an instance field. The name was unchanged. But I am open to rename.

@chenboat
Copy link
Contributor

Can you comment or add documentation on this PR what is the current behavior if the size threshold is reached?

Copy link
Contributor

@noob-se7en noob-se7en left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thank you!
@Jackie-Jiang @KKcorps please help with merging the PR.

@Jackie-Jiang Jackie-Jiang merged commit 85c3d04 into apache:master May 5, 2025
21 of 22 checks passed
@lnbest0707 lnbest0707 deleted the upstream_fork_4G branch May 5, 2025 21:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants