Skip to content

[SPARK-2871] [PySpark] add countApproxDistinct() API #2142

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

davies
Copy link
Contributor

@davies davies commented Aug 26, 2014

RDD.countApproxDistinct(relativeSD=0.05):

    :: Experimental ::
    Return approximate number of distinct elements in the RDD.

    The algorithm used is based on streamlib's implementation of
    "HyperLogLog in Practice: Algorithmic Engineering of a State
    of The Art Cardinality Estimation Algorithm", available
    <a href="http://dx.doi.org/10.1145/2452376.2452456">here</a>.

    This support all the types of objects, which is supported by
    Pyrolite, nearly all builtin types.

    @param relativeSD Relative accuracy. Smaller values create
                       counters that require more space.
                       It must be greater than 0.000017.

    >>> n = sc.parallelize(range(1000)).map(str).countApproxDistinct()
    >>> 950 < n < 1050
    True
    >>> n = sc.parallelize([i % 20 for i in range(1000)]).countApproxDistinct()
    >>> 18 < n < 22
    True

@davies
Copy link
Contributor Author

davies commented Aug 27, 2014

@mateiz @JoshRosen this one was separated from #1791, please take a look at it. ( @mateiz had reviewed this part in that PR, sorry for the duplicated reviewing).

counters that require more space.
It must be greater than 0.000017.

>>> n = sc.parallelize(range(1000)).map(str).countApproxDistinct()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u add a test to make sure that if u have 1000 non-distinct elements (i.e. the same element appearing 1000 times), this doesn't return ~ 1000?

Asking because I'm not sure how pyspark interacts with Java - if it is through byte array, then the hashcode could be wrong for byte arrays.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, it helped me to find out an invalid test case of tuple, which will be unpickled as []Object in Java, and the hashCode of it is not determined by content, so I changed it into set([]), which should have similar behavior across Python and Java.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had changed it to do hash calculation in Python, so it can support all hashable types in Python.

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have started for PR 2142 at commit dfd2a2a.

  • This patch merges cleanly.

@davies davies force-pushed the countApproxDistinct branch from dfd2a2a to 4cba98f Compare August 28, 2014 03:26
@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have started for PR 2142 at commit 4cba98f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have finished for PR 2142 at commit 4cba98f.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have finished for PR 2142 at commit dfd2a2a.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies davies force-pushed the countApproxDistinct branch from 2541785 to ded624f Compare August 28, 2014 19:19
@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have started for PR 2142 at commit ded624f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have started for PR 2142 at commit d306492.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have started for PR 2142 at commit 9d2565f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have finished for PR 2142 at commit ded624f.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have finished for PR 2142 at commit d306492.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have finished for PR 2142 at commit 9d2565f.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have started for PR 2142 at commit 2ab157c.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have finished for PR 2142 at commit 2ab157c.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have started for PR 2142 at commit c38c4e4.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have finished for PR 2142 at commit c38c4e4.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -72,7 +72,7 @@ def portable_hash(x):
for i in x:
h ^= portable_hash(i)
h *= 1000003
h &= 0xffffffff
h &= sys.maxint
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 64-bit machines, the hash of tuple() should be 64 bits (sys.maxint is 64bits).

@JoshRosen
Copy link
Contributor

To check my understanding of the error-correction code:

  • Due to hash collisions, we may underestimate the true number of distinct items.
  • Given a set of k random 32-bit hashes, the exact probability of at least one collision (from your link) is 1 - e^(-k(k-1)/2*2^32). We can approximate 1 - e^X by X for small X. Therefore, the approximate probability of some hash collision is -k(k-1)/2*2^32, which is roughly k^2/2^33.

I'm a bit confused about how the current correction term works:

c = - sys.maxint * log(1 - float(c) / sys.maxint)

It looks like this is correcting for overestimates of the number of distinct elements by subtracting a term based on the collision probability. In general, won't collisions cause us underestimate instead of overestimating?

Maybe we should approach this by treating the true number of distinct items (k) as a random variable and figuring out the maximum likelihood estimator of k given an observation c of the result of countApproxDistinct.

Before we consider that, though, I wonder whether we even need a correction term. Doesn't the Java implementation of countApproxDistinct already introduce hashing errors that are corrected for in its implementation? I don't think that the two levels of hashing will introduce more error, since I think the hashcode of a Java integer should just be its value.

Note: I'm not a statistician; please correct me if I've gotten anything wrong.

@davies
Copy link
Contributor Author

davies commented Sep 2, 2014

After discussion with @JoshRosen offline, we realized that it does not need correction in Python if they have the same hash space both in Python and Java, so I changed the has space (mapping to 2^32) and remove the correction.

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Sep 2, 2014

QA tests have started for PR 2142 at commit e20da47.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 2, 2014

QA tests have finished for PR 2142 at commit e20da47.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

Yeah, I don't think we any special correction because Java will use the hashcodes chosen in Python (since Integer.hashcode is just the integer's value).

@JoshRosen
Copy link
Contributor

LGTM, so I've merged this into master. Thanks!

@asfgit asfgit closed this in e2c901b Sep 2, 2014
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
RDD.countApproxDistinct(relativeSD=0.05):

        :: Experimental ::
        Return approximate number of distinct elements in the RDD.

        The algorithm used is based on streamlib's implementation of
        "HyperLogLog in Practice: Algorithmic Engineering of a State
        of The Art Cardinality Estimation Algorithm", available
        <a href="http://dx.doi.org/10.1145/2452376.2452456">here</a>.

        This support all the types of objects, which is supported by
        Pyrolite, nearly all builtin types.

        param relativeSD Relative accuracy. Smaller values create
                           counters that require more space.
                           It must be greater than 0.000017.

        >>> n = sc.parallelize(range(1000)).map(str).countApproxDistinct()
        >>> 950 < n < 1050
        True
        >>> n = sc.parallelize([i % 20 for i in range(1000)]).countApproxDistinct()
        >>> 18 < n < 22
        True

Author: Davies Liu <davies.liu@gmail.com>

Closes apache#2142 from davies/countApproxDistinct and squashes the following commits:

e20da47 [Davies Liu] remove the correction in Python
c38c4e4 [Davies Liu] fix doc tests
2ab157c [Davies Liu] fix doc tests
9d2565f [Davies Liu] add commments and link for hash collision correction
d306492 [Davies Liu] change range of hash of tuple to [0, maxint]
ded624f [Davies Liu] calculate hash in Python
4cba98f [Davies Liu] add more tests
a85a8c6 [Davies Liu] Merge branch 'master' into countApproxDistinct
e97e342 [Davies Liu] add countApproxDistinct()
@davies davies deleted the countApproxDistinct branch September 15, 2014 22:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants