Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out-of-bounds when a blank HLL is unioned with a smaller blank HLL #11

Open
Akninirith opened this issue Aug 11, 2014 · 5 comments
Open

Comments

@Akninirith
Copy link

A union operation done on two HLL objects immediately after they are initialized, with a smaller log2m in the parameter HLL than the one doing the union, throws an ArrayIndexOutOfBoundsException. The below code should be capable of reproducing this error.

HLL shortHLL = new HLL(14, 6, -1, false, HLLType.FULL);
HLL longHLL = new HLL(18, 6, -1, false, HLLType.FULL);
longHLL.union(shortHLL);

@ghost
Copy link

ghost commented Aug 11, 2014

@Akninirith Thanks for pointing this out! This should throw an IllegalArgumentException for now, since I haven't reasoned through all the rules for union-ing mixed parameter HLLs.

If you have an immediate use case involving the union of different log2m HLLs, I'd be happy to commit a "folding" static method that would produce a copy of the provided HLL at smaller log2m.

@Akninirith
Copy link
Author

There isn't a use case for this, at the moment at least, but if that changes this thread will be notified ASAP. Regardless, thank you for the timely response!

@yerenkow
Copy link
Contributor

Is this discussed to be case unioning any kind of HLL with EMPTY/EXPLICIT ? Am I correct that for different log2m & regwidth there will be different subsets of bits and different values for counters (both counter # and value to be set), so, basically this would be no-go to correctly merge one counters to another? Or there are cases when this possible?

@rcrezende
Copy link

rcrezende commented Feb 21, 2018

@ghost one scenario is hundreds of thousands of sparse HLL_s (or even millions) with a small regwidth_s is stored in disk. Later, merging happens into a single full HLL_m that can represent bigger cardinality values (regwidth_m>=regwidth_s) just for the purpose of computing the cardinality of the union. For this scenario, HLL_m would not be persisted. That might save disk storage and even reduce read cost.

@DanySK
Copy link

DanySK commented Mar 16, 2020

This may fail suprisingly, though.
I ran into this problem while implementing the algorithms proposed in this paper.

Even though it's a known issue that won't get solved as it is deemed unimportant / too cornery, I think it should at least get accurately documented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants