Skip to content

Decrease minimum deletes percentage in TMP #14893

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

stefanvodita
Copy link
Contributor

@stefanvodita stefanvodita commented Jul 2, 2025

Description

This will allow users to set lower deletes percentages on their merge policies, which can be used to create indexes that are more efficient to search.

Previous discussion: #11761

@@ -139,9 +139,9 @@ public double getMaxMergedSegmentMB() {
* amplification factor will lead to higher CPU and I/O activity as indicated above.
*/
public TieredMergePolicy setDeletesPctAllowed(double v) {
if (v < 5 || v > 50) {
if (v <= 0 || v > 50) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like that we still enforce > 0. It makes it more apparent that you are approaching a singular pathological case as you go to 0.001 or whatever

Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK to take the training wheels off here, but, we should at least add a stern warning. Values below 5% are really quite insane -- O(N^2) merging cost, where the future cost of merging is O(N^2) when the initial indexing cost was N.

@@ -130,7 +130,7 @@ public double getMaxMergedSegmentMB() {
/**
* Sets the maximum percentage of doc id space taken by deleted docs. The denominator includes
* both active and deleted documents. Lower values make the index more space efficient at the
* expense of increased CPU and I/O activity. Values must be between 5 and 50. Default value is
* expense of increased CPU and I/O activity. Values must be between 0 and 50. Default value is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we enhance this javadoc to note the dangers of very low (< 5%) target deletions? Something like:

Values below 5% can lead to exceptionally high merge cost where indexing will continuously
merge nearly all segments, and select newly merged segments immediately for merging again,
often forcing degenerate merge selection like singleton merges.  If you venture into this dark
forest, consider limiting the maximum number of concurrent merges and threads (link to
ConcurrentMergeScheduler's setMaxMergesAndThreads) as a coarse attempt to bound the
otherwise pathological indexing behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion @mikemccand! I added the paragraph almost exactly as you wrote it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants