Skip to content

Refactor doc values to expose a DocIdSetIterator instead of extending DocIdSetIterator. #14475

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Apr 10, 2025

This should help reduce polymorphism of DocIdSetIterator as doc values would then only expose 2 classes as DocIdSetIterators: dense iterators and IndexedDISI for the sparse case.

This change almost only touches implementations of doc values, a separate change will be needed to fix call sites of docValues#nextDoc() and similar DocIdSetIterator APIs to go through docValues#iterator() instead.

Relates #14450

…ng `DocIdSetIterator`.

This should help reduce polymorphism of `DocIdSetIterator` as doc values would
then only expose 2 classes as `DocIdSetIterator`s: dense iterators and
`IndexedDISI` for the sparse case.

This change almost only touches implementations of doc values, a separate
change would be needed to fix call sites of `docValues#nextDoc()` and similar
`DocIdSetIterator` APIs to go throu `docValues#iterator()` instead.

Relates apache#14450
@gf2121
Copy link
Contributor

gf2121 commented Apr 11, 2025

I like the idea!

Think out loud: advanceExact is a rather frequently used method in past experience of using docvalues. I wonder if we can make it also included in the iterator by exposing a sub interface of DocIdSetIterator? It also helps cluster all operations of docs in the exposed iterator.

@jpountz
Copy link
Contributor Author

jpountz commented Apr 11, 2025

Indeed advanceExact is ofter used from within collectors. It proved often faster than calling advance and checking if it returns the target doc ID.

I agree that it is not great the way it is. I was actually thinking of going one step further compared with your suggestion and adding advanceExact directly to DocIdSetIterator, changing its contract to no longer guarantee that docID() returns the target if advanceExact returns false, and adding a default implementation that calls advance.

The benefit of adding advanceExact directly to DocIdSetIterator is that then we no longer need to duplicate (and increase polymorphism) impls such as the one returned by DocIdSetIterator#all for both the case when it supports advanceExact and the case when it doesn't.

@gf2121
Copy link
Contributor

gf2121 commented Apr 14, 2025

The benefit of adding advanceExact directly to DocIdSetIterator is that then we no longer need to duplicate (and increase polymorphism) impls such as the one returned by DocIdSetIterator#all for both the case when it supports advanceExact and the case when it doesn't.

+1 for this approach, thanks for sharing and explanation!

@gf2121
Copy link
Contributor

gf2121 commented Apr 18, 2025

It occurs to me that logic like:

DocValuesIterator iter = DocValuesIterator.of(1, 2, 4);
boolean exist = iter.advanceExact(3);
assert exist == false;
iter.intoBitSet(upto, bitset, offset);

now seems problematic, because the default implementation of intoBitSet starts with a docID() that doesn't actually exist?

+1 again to change the contract, let's make docID() always return a existed doc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants