-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Updating Dense#intoBitSet to properly set upTo if it exceeds bitset size #14922
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…computation for disiTo to be more robust
int destFrom = disi.doc - offset; | ||
int disiTo = upTo - offset > bitSet.length() ? bitSet.length() : upTo; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an alternative to this could also be the original approach, i.e. update upper bound iff upTo == NO_MORE_DOCS
int disiTo = upTo == DocIdSetIterator.NO_MORE_DOCS ? bitSet.length() : upTo;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
OK, I ran all the tests I saw fail when the other fix was merged. This one didn't exhibit the same flakiness after 100s of runs. I will merge it tomorrow unless there are prevailing feelings otherwise. |
I'm thinking a bit more about this and whether autocorrecting this might hide other potential issues. In practice if a caller provides an incorrect |
Good point, I checked lucene/lucene/core/src/java/org/apache/lucene/util/BitSetIterator.java Lines 99 to 105 in 633bb1c
But we did not have the check when it comes to the default impl. lucene/lucene/core/src/java/org/apache/lucene/search/DocIdSetIterator.java Lines 205 to 210 in 633bb1c
To me the contract here is that caller should guarantee there is no doc between It would be great if @jpountz can double check this :) |
The javadoc says that this method should behave like the default impl. So if there's a matching doc between Maybe we can do something similar here? Use an optimized impl only up to |
So, this diff from main passes the new test added here and all others. Is this what you had in mind @gf2121 @jpountz diff --cc lucene/core/src/java/org/apache/lucene/codecs/lucene90/IndexedDISI.java
index 1d64be1a928,4850092ff01..00000000000
--- a/lucene/core/src/java/org/apache/lucene/codecs/lucene90/IndexedDISI.java
+++ b/lucene/core/src/java/org/apache/lucene/codecs/lucene90/IndexedDISI.java
@@@ -473,8 -473,9 +473,10 @@@ public final class IndexedDISI extends
@Override
public void intoBitSet(int upTo, FixedBitSet bitSet, int offset) throws IOException {
- assert doc >= offset;
- while (doc < upTo && method.intoBitSetWithinBlock(this, upTo, bitSet, offset) == false) {
+ assert doc >= offset : "offset=" + offset + " doc=" + doc;
+ int disiUpto = Math.min(upTo, bitSet.length() + offset);
- while (doc < disiUpto && method.intoBitSetWithinBlock(this, disiUpto, bitSet, offset) == false) {
++ while (doc < disiUpto
++ && method.intoBitSetWithinBlock(this, disiUpto, bitSet, offset) == false) {
readBlockHeader();
boolean found = method.advanceWithinBlock(this, block);
assert found; |
@benwtrent A problem with this patch is that it hides bugs on the caller side if they pass a Looking at the stack trace of the failure, I think that I understand the problem now:
So we probably need to fix |
So that would translate to properly setting Would something like the following (conceptually) capture what we need? diff --git a/lucene/core/src/java/org/apache/lucene/codecs/lucene90/IndexedDISI.java b/lucene/core/src/java/org/apache/lucene/codecs/lucene90/IndexedDISI.java
index 1d64be1a92..78604322ff 100644
--- a/lucene/core/src/java/org/apache/lucene/codecs/lucene90/IndexedDISI.java
+++ b/lucene/core/src/java/org/apache/lucene/codecs/lucene90/IndexedDISI.java
@@ -473,7 +473,7 @@ public final class IndexedDISI extends AbstractDocIdSetIterator {
@Override
public void intoBitSet(int upTo, FixedBitSet bitSet, int offset) throws IOException {
- assert doc >= offset;
+ assert doc >= offset : "offset=" + offset + " doc=" + doc;
while (doc < upTo && method.intoBitSetWithinBlock(this, upTo, bitSet, offset) == false) {
readBlockHeader();
boolean found = method.advanceWithinBlock(this, block);
@@ -719,10 +719,10 @@ public final class IndexedDISI extends AbstractDocIdSetIterator {
if (disi.bitSet == null) {
disi.bitSet = new FixedBitSet(BLOCK_SIZE);
}
-
- int sourceFrom = disi.doc & 0xFFFF;
- int sourceTo = Math.min(upTo - disi.block, BLOCK_SIZE);
int destFrom = disi.doc - offset;
+ int destTo = Math.min(upTo, offset + bitSet.length());
+ int sourceFrom = disi.doc & 0xFFFF;
+ int sourceTo = Math.min(destTo - disi.block, BLOCK_SIZE);
long fp = disi.slice.getFilePointer();
disi.slice.seek(fp - Long.BYTES); // seek back a long to include current word (disi.word).
@@ -730,14 +730,32 @@ public final class IndexedDISI extends AbstractDocIdSetIterator {
disi.slice.readLongs(disi.bitSet.getBits(), disi.wordIndex, numWords);
FixedBitSet.orRange(disi.bitSet, sourceFrom, bitSet, destFrom, sourceTo - sourceFrom);
+ int nextSourceBit = sourceTo + 1;
+ int sourceUpTo = Math.min(disi.bitSet.length(), upTo);
+ if(destTo < upTo && nextSourceBit < sourceUpTo && disi.bitSet.nextSetBit(nextSourceBit, sourceUpTo) != NO_MORE_DOCS) {
+ throw new IllegalStateException("There are bits set in the source bitset that are not accounted for."
+ + " sourceFrom="
+ + sourceFrom
+ + ", sourceTo="
+ + sourceTo
+ + ", destFrom="
+ + destFrom
+ + ", destTo="
+ + destTo
+ + ", offset="
+ + offset
+ + ", disi.bitSet.nextSetBit()="
+ + disi.bitSet.nextSetBit(nextSourceBit, Math.min(disi.bitSet.length(), upTo)));
+ }
+
int blockEnd = disi.block | 0xFFFF;
- if (upTo > blockEnd) {
+ if (destTo > blockEnd) {
disi.slice.seek(disi.blockEnd);
disi.index += disi.bitSet.cardinality(sourceFrom, sourceTo);
return false;
} else {
disi.slice.seek(fp);
- return advanceWithinBlock(disi, upTo);
+ return advanceWithinBlock(disi, destTo);
}
} |
Yes, something like that. To properly meet the contract, I'd rather do |
…e that all bits are properly accounted for
…ene into fix_for_dense_disi_or_range
Thanks @jpountz ! Added a take on this as per the suggestions. Let me know if the changes look ok now or if i should revisit this. |
…ene into fix_for_dense_disi_or_range
Another take of #14882 that was reverted.
The issue seems to be that setting
ignores
offset
, so it's possible that we could end up operating on less than the requested size (which caused test failures after the PR was merged).cc @gf2121 @benwtrent