-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bloom refactor #39
Bloom refactor #39
Conversation
we are rarely interested in the theoretical optimum value, just the returned value from the suggesting function
this doesn't have to be part of the main for the code and they contain quite a bit of complexity and interdependency so I factored them out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the eventual aim is to have different bloom filter implementations, then it may be good to first decide what functionality will be common between the filters (if we had a bloom filter 'interface' or trait, what methods would be in it?). Once that is decided, it would probably be easier to make the code readable and extensible.
If we don't want to have different bloom filters, then maybe the maths functions can just stay where they are.
@@ -1,56 +1,97 @@ | |||
#[cfg(test)] | |||
mod tests { | |||
use super::super::BloomFilter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using mod tests
seems to be the Rust way of doing unit tests (https://doc.rust-lang.org/rust-by-example/testing/unit_testing.html). Is there a reason you are changing this? With your changes I am not sure that all the functions are covered by #[cfg(test)]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that was part of a refactor into separate file.
Unit tests are not required to be in the same file to be valid, but I forgot about those specific changes.
I think it was since refactoring out into a file makes mod tests and the import redundant, but I'll have to check.
// Add more test cases here as needed | ||
]; | ||
|
||
for test_case in test_cases { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing a loop over test cases means that if the first test case fails, none of the others will run. I think a better approach would be to create a separate method per test, and each such test would create a TestCase
and pass it to a base method with the test case logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it was a rough draft to illustrate potential test cases.
If i understand correctly, the more idiomatic approach would be parametrized tests, but it might require importing a test harness such as criterion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parametrized tests sound good too.
BloomFilter::suggest_size_in_bytes(expected_elements, desired_false_positive_rate); | ||
assert_eq!(suggested_size, 4_194_304); | ||
assert_eq!(suggested_size, theoretical_optimum.next_power_of_two()) | ||
tested_size, simplified_size, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If simplified_suggest_size
is an improved form of suggest_size_in_bytes
, then it is better not to have it in the unit tests like this. Instead, the implementation of suggest_size_in_bytes
should be replaced with the implementation of simplified_suggest_size
(probably in a separate change).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i generally agree.
@@ -0,0 +1,53 @@ | |||
use super::BloomFilter; | |||
|
|||
impl BloomFilter { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does moving the maths functions here help with sharing them to other possible implementations in the future? The only way this maths can be reused is if the other bloom filter implementations keep a copy of BloomFilter
internally. I don't think it's a good idea to do that.
Maybe you can put the maths logic in a new struct. Then it would make sense for multiple filter implementations to consume the maths logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that makes sense as well
As for potential other implementation of bloom, it would be the more conventional lock based bloomfilter. Anyway it would be mostly the same except using a Mutex to envelop the bloomfilter. |
I am not sure yet it bff and dolma will be parallel implementations, independent, or one will replace the other. Can anybody shed some light here? |
@dirkgr Do you have any context on this, or more broadly any idea on the future of the filter in dolma and bff repos? |
BFF and Dolma will be separate for the time being, mainly because we don't have the time to focus on unifying them. In my ideal scenario BFF would be a crate that can run standalone, but also be consumed by Dolma as a library. But unless you, @chris-ha458, want to step up to make that happen, I think we have to keep them separate. I asked @2015aroras to back-port some of your fixes to BFF because they seem quite important, especially the one about the stable hash functions. I am not super excited about switching to mutex based locking, unless you can show some numbers that show that it is conclusively faster, even with >256 cores running in parallel. I know the atomic int stuff looks like a lot of code, but it should compile down to a single instruction, no? |
A mutex based approach would definitely not be faster. Depending on lock contention (which would be expected to be high) using it with multiple threads could even be slower than using a single thread. Hope fully such implementation could be (easily) done when BFF becomes a standalone crate that is consumed by dolma and tested either as part of dolma or BFF integration tests. |
In line with this comment, I think that making BFF a standalone crate that is consumed by dolma should be prioritized over making the implementation thread safe (and maybe even over using a stable hashing implementation). |
I have further refactored some Bloom code and tests.
This would help reimplementing or adding alternatives to the current Bloom.
It isolates functions and math related to Bloom that could be shared.
I recognize that this is a significant refactor that might not align to the goals of the developers or project, but i do feel that it improves the readability, maintainability and extensibility of the code.
Although I have some more changes in mind, I bring this forward at this time so to discuss how to proceed (or not to) further.
One thing I did notice during the refactor was a pecularity of
pub fn suggest_size_in_bytes
for clarity I extracted the value being compared out as
let max_size: usize = usize::MAX / 2; // 9E18 bytes 8exbi-bytes
As per my comments there, that seems like an unrealistic upper bound.
I would think something closer to 1 TiB would be more appropriate, or for it to be configurable.
But I do not know the specific setup that allenai is using internally. So instead of trying to change it myself,
I bring it to your attention.
Also I am now feeling that we could just replace the
pub fn suggest_size_in_bytes
with a function that I originally built for testing
pub fn simplified_suggest_size
That function looks simpler (subjective) does not rely on loops, mutable variables or outside functions (since we can derive it from first principles regarding bloom maths).
We could still keep the original function (maybe swap places) for testing and comparison purposes as well.
These are the kind of discoveries and (hopefully) changes that I hoped to make through this refactor.