Skip to content

Updating Dense#intoBitSet to properly set upTo if it exceeds bitset size #14922

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

pmpailis
Copy link
Contributor

@pmpailis pmpailis commented Jul 8, 2025

Another take of #14882 that was reverted.

The issue seems to be that setting

int disiTo = Math.min(upTo, bitSet.length())

ignores offset, so it's possible that we could end up operating on less than the requested size (which caused test failures after the PR was merged).

cc @gf2121 @benwtrent

int destFrom = disi.doc - offset;
int disiTo = upTo - offset > bitSet.length() ? bitSet.length() : upTo;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an alternative to this could also be the original approach, i.e. update upper bound iff upTo == NO_MORE_DOCS

int disiTo = upTo == DocIdSetIterator.NO_MORE_DOCS ? bitSet.length() : upTo;

@pmpailis pmpailis marked this pull request as ready for review July 8, 2025 14:20
Copy link
Contributor

@gf2121 gf2121 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@benwtrent
Copy link
Member

OK, I ran all the tests I saw fail when the other fix was merged. This one didn't exhibit the same flakiness after 100s of runs. I will merge it tomorrow unless there are prevailing feelings otherwise.

@pmpailis
Copy link
Contributor Author

pmpailis commented Jul 9, 2025

I'm thinking a bit more about this and whether autocorrecting this might hide other potential issues. In practice if a caller provides an incorrect upTo or offset, we would try to autocorrect here and provide a best effort (i.e. not actually accounting for upTo bits, but instead just for bitSize). So even though we wouldn't be met with IOOB exceptions, there would most likely be other issues down the line. Maybe addressing this only when upTo == NO_MORE_DOCS , or directly to the source of the "problem" in AbstractKnnVectorQuery might be a more safe approach?

@gf2121
Copy link
Contributor

gf2121 commented Jul 10, 2025

i.e. not actually accounting for upTo bits, but instead just for bitSize

Good point, I checked BitsetIterator#intoBitset and we had similar logic there.

int actualUpto = Math.min(upTo, length);
// The destination bit set may be shorter than this bit set. This is only legal if all bits
// beyond offset + bitSet.length() are clear. If not, the below call to `super.intoBitSet`
// will throw an exception.
actualUpto = MathUtil.unsignedMin(actualUpto, offset + bitSet.length());
FixedBitSet.orRange(fixedBits, doc, bitSet, doc - offset, actualUpto - doc);
advance(actualUpto); // set the current doc

But we did not have the check when it comes to the default impl.

public void intoBitSet(int upTo, FixedBitSet bitSet, int offset) throws IOException {
assert offset <= docID() : "offset=" + offset + " docID()=" + docID() + " upTo=" + upTo;
for (int doc = docID(); doc < upTo; doc = nextDoc()) {
bitSet.set(doc - offset);
}
}

To me the contract here is that caller should guarantee there is no doc between offset + bitset.length() and upTo if offset + bitset.length() < upTo. Maybe we should clarify it in java doc, then there is no difference between current impl and check upTo == NO_MORE_DOCS only.

It would be great if @jpountz can double check this :)

@jpountz
Copy link
Contributor

jpountz commented Jul 10, 2025

To me the contract here is that caller should guarantee there is no doc between offset + bitset.length() and upTo if offset + bitset.length() < upTo. Maybe we should clarify it in java doc, then there is no difference between current impl and check upTo == NO_MORE_DOCS only.

The javadoc says that this method should behave like the default impl. So if there's a matching doc between offset+bitSet.length() and upTo, it should call bitSet#set on this doc, which should in-turn fail since the index is out of bounds. This is why BitSetIterator#intoBitSet falls back to super.intoBitSet for any matching doc between offset+bitSet.length() and upTo.

Maybe we can do something similar here? Use an optimized impl only up to min(upTo, bitSet.length() + offset) and then delegate to super.intoBitSet(upTo, bitSet, offset) for any remaining doc?

@benwtrent
Copy link
Member

So, this diff from main passes the new test added here and all others. Is this what you had in mind @gf2121 @jpountz

diff --cc lucene/core/src/java/org/apache/lucene/codecs/lucene90/IndexedDISI.java
index 1d64be1a928,4850092ff01..00000000000
--- a/lucene/core/src/java/org/apache/lucene/codecs/lucene90/IndexedDISI.java
+++ b/lucene/core/src/java/org/apache/lucene/codecs/lucene90/IndexedDISI.java
@@@ -473,8 -473,9 +473,10 @@@ public final class IndexedDISI extends

    @Override
    public void intoBitSet(int upTo, FixedBitSet bitSet, int offset) throws IOException {
-     assert doc >= offset;
-     while (doc < upTo && method.intoBitSetWithinBlock(this, upTo, bitSet, offset) == false) {
+     assert doc >= offset : "offset=" + offset + " doc=" + doc;
+     int disiUpto = Math.min(upTo, bitSet.length() + offset);
 -    while (doc < disiUpto && method.intoBitSetWithinBlock(this, disiUpto, bitSet, offset) == false) {
++    while (doc < disiUpto
++        && method.intoBitSetWithinBlock(this, disiUpto, bitSet, offset) == false) {
        readBlockHeader();
        boolean found = method.advanceWithinBlock(this, block);
        assert found;

@benwtrent benwtrent self-assigned this Jul 11, 2025
@jpountz
Copy link
Contributor

jpountz commented Jul 12, 2025

@benwtrent A problem with this patch is that it hides bugs on the caller side if they pass a upTo that is greater than offset + bitSet.length() and this iterator has matches in this range of doc IDs.

Looking at the stack trace of the failure, I think that I understand the problem now:

  • DocIdSetIterator#intoBitSet only operates on docs that match this iterator, so it's fine if upTo > offset + bitSet.length() as long as this iterator doesn't return docs in this range.
  • On dense blocks, DocIdSetIterator#intoBitSet delegates to FixedBitSet#orRange. This other API looks similar, except that it OR's all the bits, whether they are set or not. So it's a bug when the source range of bits to OR is greater than the destination range.

So we probably need to fix DENSE#intoBitSetWithinBlock to adjust the range of bits to copy based on the destination bit set, and then ensure that the source bit set has no set bit between the last OR'ed bit and upTo.

@pmpailis
Copy link
Contributor Author

So we probably need to fix DENSE#intoBitSetWithinBlock to adjust the range of bits to copy based on the destination bit set, and then ensure that the source bit set has no set bit between the last OR'ed bit and upTo.

So that would translate to properly setting upTo as in BitSetIterator, and also throw an exception if we (i) didn't exhaust destination bit set, and (ii) there are still any bits set in the (destTo, upTo] range in the source bitset?

Would something like the following (conceptually) capture what we need?

diff --git a/lucene/core/src/java/org/apache/lucene/codecs/lucene90/IndexedDISI.java b/lucene/core/src/java/org/apache/lucene/codecs/lucene90/IndexedDISI.java
index 1d64be1a92..78604322ff 100644
--- a/lucene/core/src/java/org/apache/lucene/codecs/lucene90/IndexedDISI.java
+++ b/lucene/core/src/java/org/apache/lucene/codecs/lucene90/IndexedDISI.java
@@ -473,7 +473,7 @@ public final class IndexedDISI extends AbstractDocIdSetIterator {
 
   @Override
   public void intoBitSet(int upTo, FixedBitSet bitSet, int offset) throws IOException {
-    assert doc >= offset;
+    assert doc >= offset : "offset=" + offset + " doc=" + doc;
     while (doc < upTo && method.intoBitSetWithinBlock(this, upTo, bitSet, offset) == false) {
       readBlockHeader();
       boolean found = method.advanceWithinBlock(this, block);
@@ -719,10 +719,10 @@ public final class IndexedDISI extends AbstractDocIdSetIterator {
         if (disi.bitSet == null) {
           disi.bitSet = new FixedBitSet(BLOCK_SIZE);
         }
-
-        int sourceFrom = disi.doc & 0xFFFF;
-        int sourceTo = Math.min(upTo - disi.block, BLOCK_SIZE);
         int destFrom = disi.doc - offset;
+        int destTo =  Math.min(upTo, offset + bitSet.length());
+        int sourceFrom = disi.doc & 0xFFFF;
+        int sourceTo = Math.min(destTo - disi.block, BLOCK_SIZE);
 
         long fp = disi.slice.getFilePointer();
         disi.slice.seek(fp - Long.BYTES); // seek back a long to include current word (disi.word).
@@ -730,14 +730,32 @@ public final class IndexedDISI extends AbstractDocIdSetIterator {
         disi.slice.readLongs(disi.bitSet.getBits(), disi.wordIndex, numWords);
         FixedBitSet.orRange(disi.bitSet, sourceFrom, bitSet, destFrom, sourceTo - sourceFrom);
 
+        int nextSourceBit = sourceTo + 1;
+        int sourceUpTo = Math.min(disi.bitSet.length(), upTo);
+        if(destTo < upTo && nextSourceBit < sourceUpTo && disi.bitSet.nextSetBit(nextSourceBit, sourceUpTo) != NO_MORE_DOCS) {
+          throw new IllegalStateException("There are bits set in the source bitset that are not accounted for."
+              + " sourceFrom="
+              + sourceFrom
+              + ", sourceTo="
+              + sourceTo
+              + ", destFrom="
+              + destFrom
+              + ", destTo="
+              + destTo
+              + ", offset="
+              + offset
+              + ", disi.bitSet.nextSetBit()="
+              + disi.bitSet.nextSetBit(nextSourceBit, Math.min(disi.bitSet.length(), upTo)));
+        }
+
         int blockEnd = disi.block | 0xFFFF;
-        if (upTo > blockEnd) {
+        if (destTo > blockEnd) {
           disi.slice.seek(disi.blockEnd);
           disi.index += disi.bitSet.cardinality(sourceFrom, sourceTo);
           return false;
         } else {
           disi.slice.seek(fp);
-          return advanceWithinBlock(disi, upTo);
+          return advanceWithinBlock(disi, destTo);
         }
       }

@jpountz
Copy link
Contributor

jpountz commented Jul 14, 2025

Yes, something like that. To properly meet the contract, I'd rather do advanceWithinBlock(destTo) first, and then fail if the current doc ID is less than upTo. This way, the current doc ID would be in [destTo, upTo) after the exception is thrown, as would be the case with the default implementation whose behavior we need to mimic.

@pmpailis
Copy link
Contributor Author

pmpailis commented Jul 15, 2025

Thanks @jpountz ! Added a take on this as per the suggestions. Let me know if the changes look ok now or if i should revisit this.

@github-actions github-actions bot added this to the 10.3.0 milestone Jul 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants