In the current version of Sparse Joins and lookups, we are using Algebird's Bloom Filters Aggregators.
This works well and is very easy and elegant to express in Scala, but its not performant. It takes a long time and CPU to build this filter (~3 hrs wall time for 3M entries). For smaller datasets (~5M entries) I found a simple HashSet filter to work better, though that starts to go near the limits of SideInput size. For entries around 50M, it starts taking longer than a shuffle join.
The Bloom Filters from Breeze as well as ClearSpring Stream lib however gets constructed within 3 to 4 min wall time for 3M entries.
Suggestion 1:
I feel using Stream Lib from AddThis / ClearSpring instead of Algebird BloomFilter would allow us to optimise a lot more jobs. I am not sure why, but I was having a very high false positives with Breeze BFs (may be I was making a mistake, as it seems weird)
Breaking API changes(?)
Algebird provides us with a Hash128[K], however the BF in StreamLib takes a byte array / String, which means we would need to add an implicit ByteArrayEncoder for primitive types which are present in algebird's Hash128 to make sure we are not breaking any API.
Can we go ahead with adding this additional implicit, and move away from Algebird's BF for sparse functions?
Also because this adds an additional dependency, should this be a part of scio-extra which would again means removing those methods from scio-core thus breaking again? I am a bit confused about adding an extra dependency.
I have some code in an internal repository using StreamLib and here are some benchmarks. Based on suggestions on the above, I can create a PR fixing this.
Here is a Dataflow join of ~4M with two datasets of 1B each.
Using Algebird Bloom Filters

Using ClearSpring / AddThis BloomFilters

Suggestion 2:
Optimise Algebird Bloom Filters, and make it run faster. The Bloom Filter in Algebird is running very slow, but it can surely be fixed to make this work faster.
In the current version of Sparse Joins and lookups, we are using Algebird's Bloom Filters Aggregators.
This works well and is very easy and elegant to express in Scala, but its not performant. It takes a long time and CPU to build this filter (~3 hrs wall time for 3M entries). For smaller datasets (~5M entries) I found a simple HashSet filter to work better, though that starts to go near the limits of SideInput size. For entries around 50M, it starts taking longer than a shuffle join.
The Bloom Filters from Breeze as well as ClearSpring Stream lib however gets constructed within 3 to 4 min wall time for 3M entries.
Suggestion 1:
I feel using Stream Lib from AddThis / ClearSpring instead of Algebird BloomFilter would allow us to optimise a lot more jobs. I am not sure why, but I was having a very high false positives with Breeze BFs (may be I was making a mistake, as it seems weird)
Breaking API changes(?)
Algebird provides us with a Hash128[K], however the BF in StreamLib takes a byte array / String, which means we would need to add an implicit ByteArrayEncoder for primitive types which are present in algebird's Hash128 to make sure we are not breaking any API.
Can we go ahead with adding this additional implicit, and move away from Algebird's BF for sparse functions?
Also because this adds an additional dependency, should this be a part of
scio-extrawhich would again means removing those methods fromscio-corethus breaking again? I am a bit confused about adding an extra dependency.I have some code in an internal repository using StreamLib and here are some benchmarks. Based on suggestions on the above, I can create a PR fixing this.
Here is a Dataflow join of ~4M with two datasets of 1B each.

Using Algebird Bloom Filters
Using ClearSpring / AddThis BloomFilters

Suggestion 2:
Optimise Algebird Bloom Filters, and make it run faster. The Bloom Filter in Algebird is running very slow, but it can surely be fixed to make this work faster.