Skip to content

Swap out some simple PriorityQueue subclasses for one using a Comparator #14705

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 5, 2025

Conversation

thecoop
Copy link
Contributor

@thecoop thecoop commented May 23, 2025

Related to #11338 (comment), this replaces some simple uses of PriorityQueue subclasses with ones using a Comparator. This reduces the boilerplate required, the number of classes defined, and moves the definition of the PriorityQueue close to where it is used, so the relevant code is more self-contained and easier to understand.

private void writeCompoundFile(
IndexOutput entries, IndexOutput data, Directory dir, SegmentInfo si) throws IOException {
// write number of files
int numFiles = si.files().size();
entries.writeVInt(numFiles);
// first put files in ascending size order so small files fit more likely into one page
SizedFileQueue pq = new SizedFileQueue(numFiles);
List<SizedFile> files = new ArrayList<>(numFiles);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one doesn't need to use a PriorityQueue at all

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorting should also be faster here. The Lucene PQ is only needed when items in queue should fall out at botton when its full. In other cases the ln-overhaed is larger than a simple sorting at end.

Copy link

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR.

1 similar comment
Copy link

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR.

Missed a comparator
@thecoop thecoop force-pushed the priority-queue-comparators branch from 1f07b03 to f53ae10 Compare May 23, 2025 16:43
Copy link

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR.

@thecoop
Copy link
Contributor Author

thecoop commented May 23, 2025

Next stage is to run some perf tests to check this doesn't introduce a slowdown due to changes to the virtual method calls in key hotspots.

Copy link
Contributor

@uschindler uschindler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. Why not make PQ final and pass a comparator or maybe a simpler functional interface directly in ctor. This would reduce the number of subclasses more and we only have a single PQ which is final on top.

@uschindler
Copy link
Contributor

This reduces the boilerplate required, the number of classes defined,...

But it does not reduce the number of loaded classes. The lambdas are still hidden classes.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it, it looks more like idiomatic Java. Can you check if OrdinalMapBenchmark reports any slowdown? This is the only benchmark that could be impacted by this change that I can think of.

It's slightly awkward that a PriorityQueue can be defined either using a comparator or overriding lessThan, let's only support providing a comparator as Uwe suggested?

Copy link

github-actions bot commented Jun 3, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog-check label to it and you will stop receiving this reminder on future updates to the PR.

Copy link
Contributor

@dweiss dweiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I feel tempted to make PriorityQueue final and then move the comparator to a final field. Why bother having both ways of defining it?

@thecoop
Copy link
Contributor Author

thecoop commented Jun 3, 2025

This is only the first pass, covering single-line comparators. There are several non-trivial implementations that will require more detailed refactoring (if that is indeed desirable) - using a Comparator requires differentiating equality, which lessThan does not require. There's also some that are part of the public API. I have a subsequent PR covering multi-line comparators, and then several to follow for non-trivial implementations.

@dweiss
Copy link
Contributor

dweiss commented Jun 3, 2025

Good point, thanks.

@dweiss
Copy link
Contributor

dweiss commented Jun 3, 2025

We could also declare a custom interface with just lessThan and then provide an adapter in usingComparator... this would dodge the need for implementing full equality checks across the board and you could still use jdk's Comparator utilities, where things are simple.

@thecoop
Copy link
Contributor Author

thecoop commented Jun 3, 2025

That's a good idea - I'll look at that when I'm moving onto the more complex implementations

@uschindler
Copy link
Contributor

We should really only have one final PQ implementation to make sure most of the code can be inlined.

@uschindler
Copy link
Contributor

Basically the lessThan interface should be marked @FunctionalInterface and wherever possible the lessTahn check should be a lambda or method ref.

@thecoop
Copy link
Contributor Author

thecoop commented Jun 3, 2025

That's the desired end state, yes. There's still plenty of implementations to get through before we can make it final.

@thecoop thecoop requested a review from dweiss June 5, 2025 10:07
Copy link
Contributor

@dweiss dweiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to see what the benchmarks show after this change. These pq's are sort of crucial to performance. I wonder if we can observe any change after subclasses are replaced by delegation.

In general, I'm fine to making baby steps although it could also be developed into a larger patch that adds this additional "smaller-than" interface and makes the PQ final. I wonder what others think.

@thecoop
Copy link
Contributor Author

thecoop commented Jun 5, 2025

I wasn't able to run OrdinalMapBenchmark, but a few lucene-bench runs show no large changes - several percent +/- either way

@jpountz
Copy link
Contributor

jpountz commented Jun 5, 2025

For reference, there is an interesting PR at #14714 that replaces PriorityQueue with a LongHeap when computing top-k hits by score. Results suggest that the PriorityQueue is not a bottleneck at all when k=100, but is a bottleneck for our fastest queries when k=1000. This PR doesn't change computing top-k hits by field, so tasks TermMonthSort, TermTitleSort, TermDayOfYearSort or TermDTSort would still use a PriorityQueue to track top hits.

@jpountz
Copy link
Contributor

jpountz commented Jun 5, 2025

I wasn't able to run OrdinalMapBenchmark

I ran it on my machine. Got the following results on main (2 runs):

id: 651.43124 msec
name: 955.36439 msec
country_code: 0.37753 msec
time_zone: 1.03004 msec
id: 605.82902 msec
name: 923.61305 msec
country_code: 0.30129 msec
time_zone: 0.87078 msec

And the following results on your branch (2 runs as well):

id: 641.80320 msec
name: 945.66611 msec
country_code: 0.31938 msec
time_zone: 0.89374 msec
id: 631.90609 msec
name: 970.12461 msec
country_code: 0.32877 msec
time_zone: 0.91463 msec

So it looks fine to me as far as this benchmark is concerned.

@jpountz
Copy link
Contributor

jpountz commented Jun 5, 2025

In general, I'm fine to making baby steps although it could also be developed into a larger patch that adds this additional "smaller-than" interface and makes the PQ final. I wonder what others think.

My intuition was that this "smaller-than" interface idea you brought up would be relatively easy to introduce to all our PriorityQueue impls, so it would be nice if we could skip the intermediate state where there are two ways to define how heap entries are ordered. But maybe I'm underestimating the effort.

@thecoop
Copy link
Contributor Author

thecoop commented Jun 5, 2025

There are some non-trivial subclasses (TopDocs.MergeSortQueue, MultiTermsEnum.TermMergeQueue, NearSpansUnordered.SpanTotalLengthEndPositionWindow), some queues that themselves are subclassed (FieldValueHitQueue, TopOrdAndNumberQueue), and some part of the public API (HitQueue, FieldValueHitQueue, SuggestWordQueue). This PR handles the trivial cases, I was going to create subsequent PRs to cover the more complex refactorings so they can be considered separately, especially those that change the public API.

Copy link
Contributor

@dweiss dweiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with merging, given the explanation. Please add a note to changes and migration? Seems like something worth mentioning.

@jpountz
Copy link
Contributor

jpountz commented Jun 5, 2025

I'm fine with merging too.

@github-actions github-actions bot added this to the 11.0.0 milestone Jun 5, 2025
@dweiss dweiss merged commit 9afcfdb into apache:main Jun 5, 2025
7 checks passed
@dweiss
Copy link
Contributor

dweiss commented Jun 5, 2025

Merged. Thank you, @thecoop

@thecoop thecoop deleted the priority-queue-comparators branch June 5, 2025 15:35
@uschindler
Copy link
Contributor

Thanks! Cool first step!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants