[SPARK-1170] Add histogram method to Python's RDD API #1783

nrchandan · 2014-08-05T08:20:12Z

Tested and ready to merge.

…Buckets).

AmplabJenkins · 2014-08-05T08:22:45Z

Can one of the admins verify this patch?

ScrapCodes · 2014-08-05T08:23:40Z

Jenkins, test this please.

ScrapCodes · 2014-08-05T09:05:50Z

python/pyspark/rdd.py

+            max = mm_stats.max()
+            increment = (max - min) * 1.0 / bucketCount
+            if increment != 0:
+                buckets = [round(min+x*increment, 2) for x in range(bucketCount+1)]


So this generates ranges in two digits of precision in case of floating point. I feel we should make it at least three.

And please add a test case for this too.

JoshRosen · 2014-08-05T16:26:41Z

I'll try to review this later today or tomorrow.

@davies, you might want to take a look at this, too?

davies · 2014-08-05T16:35:41Z

python/pyspark/rdd.py

@@ -901,6 +902,97 @@ def sampleVariance(self):
        1.0
        """
        return self.stats().sampleVariance()
+
+    def histogram(self, buckets=None, evenBuckets=False, bucketCount=None):


we can define this API as

def histogram(self, buckets, even=False):

buckets can be list or int.

That makes sense. Why didn't I come up with it :)

JoshRosen · 2014-08-05T16:57:32Z

@freeman-lab @sryza @MLnick might also be interested in this, since we also wrote a histogram function at the PySpark + scikit-learn hackathon: https://github.com/ogrisel/spylearn/blob/master/spylearn/histogram.py

davies · 2014-08-05T17:44:45Z

python/pyspark/rdd.py

+            bucketNumber = (e - minimum) * 1.0 / inc # avoid Python integer div
+            if (bucketNumber > count or bucketNumber < 0):
+                return None
+            return min(int(bucketNumber), count -1)


The right of bucket is open, so if bucketNumber is integer, the return value should be int(bucketNumer) - 1

@davies This part of the code is taken straight from the Scala version of the histogram API. I will investigate and get back to you.

davies · 2014-08-05T17:59:29Z

Thank you for working on this.

Last night, I just start to working on this API, it's implemented as wrapper to call Java API. [https://github.com//pull/1791/files]

How do you think of this two approaches?

freeman-lab · 2014-08-05T19:19:44Z

Cool work guys, this would be great to have. Both approaches are likely more efficient that what we did at the hackathon, which was a simpler implementation. I'd be curious about performance between the two. The Java API version might be faster, and also easier to maintain. I could try to do some large-scale testing if useful.

davies · 2014-08-05T20:14:02Z

The Java API can only works better for float, this Python version can work for int, even string and complex.

Should we merge this one first? In the future, may be could try to use Java API only when the rdd is JavaRDDDouble.

davies · 2014-08-06T05:56:22Z

I had make a version similar to this in #1791 , plz take a look at it.

nrchandan · 2014-08-06T06:00:31Z

@davies I'll go through your code today. I think you meant #1791

nrchandan · 2014-08-06T06:37:58Z

@davies #1791 looks good. Feel free to close this one as duplicate.

dwmclary and others added 4 commits August 5, 2014 12:52

added histogram method, added max and min to statscounter

aecb5bc

SPARK-1170 Added histogram(buckets) to pyspark and not histogram(noOf…

0c2bbdd

…Buckets).

SPARK-1170. Merged commits and fixed bugs in both the original commits

8427db6

SPARK-1170. Merged commits and fixed bugs in both the original commits

bdd3d7a

SPARK-1170. Fix a test case.

7fe070a

ScrapCodes reviewed Aug 5, 2014
View reviewed changes

Chandan Kumar added 2 commits August 5, 2014 15:34

[SPARK-1170] Remove unnecessary rounding

7b522d5

SPARK-1170. Fix a typo in doc comment.

c8dd625

davies reviewed Aug 5, 2014
View reviewed changes

davies mentioned this pull request Aug 6, 2014

[SPARK-2871] [PySpark] Add missing API #1791

Closed

nrchandan closed this Aug 6, 2014

[SPARK-1170] Add histogram method to Python's RDD API #1783

[SPARK-1170] Add histogram method to Python's RDD API #1783

Uh oh!

Conversation

nrchandan commented Aug 5, 2014

Uh oh!

AmplabJenkins commented Aug 5, 2014

Uh oh!

ScrapCodes commented Aug 5, 2014

Uh oh!

ScrapCodes Aug 5, 2014

Choose a reason for hiding this comment

Uh oh!

ScrapCodes Aug 5, 2014

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Aug 5, 2014

Uh oh!

davies Aug 5, 2014

Choose a reason for hiding this comment

Uh oh!

nrchandan Aug 5, 2014

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented Aug 5, 2014

Uh oh!

davies Aug 5, 2014

Choose a reason for hiding this comment

Uh oh!

nrchandan Aug 6, 2014

Choose a reason for hiding this comment

Uh oh!

davies commented Aug 5, 2014

Uh oh!

freeman-lab commented Aug 5, 2014

Uh oh!

davies commented Aug 5, 2014

Uh oh!

davies commented Aug 6, 2014

Uh oh!

nrchandan commented Aug 6, 2014

Uh oh!

nrchandan commented Aug 6, 2014

Uh oh!

Uh oh!