-
Notifications
You must be signed in to change notification settings - Fork 28.5k
[WIP] [SPARK-47547] BloomFilter fpp degradation #50933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[WIP] [SPARK-47547] BloomFilter fpp degradation #50933
Conversation
…errors in scala suite
…of the combined hash
} | ||
} | ||
|
||
long mightContainEven = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rename these 2 in this test case to clarify that these are actually indices of numbers in a randomly generated stream.
optimalNumOfBits / Byte.SIZE / 1024 / 1024 | ||
); | ||
Assumptions.assumeTrue( | ||
2 * optimalNumOfBits / Byte.SIZE < 4 * ONE_GB, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess 4 * ONE_GB
is a reasoable limit, can we extract it to a constant and add some comment to it.
…eckstyle errors, renaming test vars
…ward compatible with previously serialized streams
"mightContainLong must return true for all inserted numbers" | ||
); | ||
|
||
double actualFpp = (double) mightContainOddIndexed / numItems; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ numItems
doesn't seem correct here as you don't test numItems
number of numbers that were surely not added into the filter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed, it should probably be very close to the proper value, but this calculation doesn't account for the odd indexes ignored based on the secondary's result.
let me try to address that somehow.
8edf4dd
to
57298f0
Compare
… in random test + test formatting
57298f0
to
f589e2c
Compare
Can you please post the output of the new |
the tests with the 4GB limit are still running, I'll post a summary from the results tomorrow, and start a new run that can cover all of the 5G element count cases. |
What changes were proposed in this pull request?
This change fixes a performance degradation issue in the current BloomFilter implementation.
The current bit index calculation logic does not use any part of the indexable space above the first 31bits, so when the inserted item count approaches (or exceeds) Integer.MAX_VALUE, it will produce significantly worse collision rates than an (ideal) uniformly distributing hash function.
Why are the changes needed?
This should qualify as a bug.
The upper bound on the bit capacity of the current BloomFilter implementation in spark is approx 137G bits (64 bit longs in an Integer.MAX_VALUE sized array). The current indexing scheme can only address about 2G bits of these.
On the other hand, due to the way the BloomFilters are used, the bug won't cause any logical errors, it will gradually render the BloomFilter instance useless by forcing more-and-more queries on the slow path.
Does this PR introduce any user-facing change?
No
How was this patch tested?
new test
One new java testclass was added to
sketch
to test different combinations of item counts and expected fpp rates.testAccuracyEvenOdd
in N number of iterations inserts N even numbers (2*i), and leaves out N odd numbers (2*i+1) from the BloomFilter.
The test checks the 100% accuracy of
mightContain=true
on all of the even items, and measures themightContain=true
(false positive) rate on the not-inserted odd numbers.testAccuracyRandom
in 2N number of iterations inserts N pseudorandomly generated numbers in two differently seeded (theoretically independent) BloomFilter instances. All the random numbers generated in an even-iteration will be inserted into both filters, all the random numbers generated in an odd-iteration will be left out from both.
The test checks the 100% accuracy of
mightContain=true
for all of the items inserted in an even-loop. It counts the false positives as the number of odd-loop items for which the primary filter reportsmightContain=true
but secondary reportsmightContain=false
. Since we inserted the same elements into both instances, and the secondary reports non-insertion, themightContain=true
from the primary can only be a false positive.patched
One minor (test) issue was fixed in
where the potential repetitions in the randomly generated stream of insertable items resulted in slightly worse fpp measurements than the actual. The problem affected the those testcases more where the cardinality of the tested type is low (the chance of repetition is high), e.g. Byte and Short.
removed from the default runs
Running these test as part of the default build process was turned off with adding
@Disabled
annotation to the new testclass.Was this patch authored or co-authored using generative AI tooling?
No