Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question related to union performance #326

Closed
Cheappie opened this issue Jul 23, 2020 · 2 comments
Closed

Question related to union performance #326

Cheappie opened this issue Jul 23, 2020 · 2 comments

Comments

@Cheappie
Copy link

Hi, I just wanted to ask whether CpcSketch is fastest Sketch for performing union ? And whether is there other way to speed up unions than decreasing logK param ?

`

static class UnionPerfTest {
    public static void main(String[] args) {

        int logK = 11;
        Random r = new Random();

        CpcSketch fllcpc = new CpcSketch(logK);
        IntStream.generate(() -> r.nextInt(10_000_000)).distinct().limit(100_000).forEach(fllcpc::update);

        CpcSketch sllcpc = new CpcSketch(logK);
        IntStream.generate(() -> r.nextInt(10_000_000)).distinct().limit(1_000_000).forEach(sllcpc::update);

        long blackHole = 0;
        long start = System.currentTimeMillis();
        for (int i = 0; i < 1_000_000; i++) {
            CpcUnion cpcUnion = new CpcUnion(logK);

            cpcUnion.update(sllcpc);
            cpcUnion.update(fllcpc);

            blackHole += cpcUnion.getResult().getEstimate();
        }

        System.out.println("Millis: " + (System.currentTimeMillis() - start));
        System.out.println("bc: " + blackHole);
    }
}

`

@leerho
Copy link
Contributor

leerho commented Jul 23, 2020

@Cheappie
We did a characterization study on this question which resulted in the last plot on the CpcPerformance page on our website. And this shows that our current implementation of CPC is faster than HLL, which is faster than Theta for both Java and C++. In addition C++ is faster than Java, as you would expect.

However, this particular characterization test held the number of sketches to be merged as a constant (32) and varied the total number of uniques equally distributed across the 32 sketches. There are numerous other ways to construct a merge speed test. I would suggest constructing a test that most closely resembles how merges are actually performed in your application.

That being said, your particular test includes the time to create the union in addition to the merges and executes only two merges. To reduce the influence of the union construction I would recommend doing a lot more than just 2 update operations.

In our back-end systems, a single union might be merging thousands to millions of sketches. Also, it is not typical for all sketches to have the same cardinality, which we assumed in our test. Your test used two different cardinalities. In the environments we encounter the sketch cardinalities tend to vary with a power-law distribution where only a few sketches are very large and then millions of sketches have very small cardinalities. Unfortunately, constructing meaningful tests for this kind of distribution is challenging because there are so many variables.

Nonetheless, the reason this is interesting is the Theta union operation has an "early-stop" feature that should speed up the unioning process considerably as the number of sketches in the merge loop gets large. This might change the ordering of the performance ranking. Constructing such a test is something we have been wanting to do but haven't yet gotten around to. :)

@Cheappie
Copy link
Author

@leerho
Thank you for such detailed answer. After playing with sketches, benchmarking a bit, and seeing charts I decided to go with CPC. The reason why I do 2 update op's per union is that I want to exclude dissimilar characteristics before performing search for exact counts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants