-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-1170] Add histogram method to Python's RDD API #1783
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Can one of the admins verify this patch? |
Jenkins, test this please. |
max = mm_stats.max() | ||
increment = (max - min) * 1.0 / bucketCount | ||
if increment != 0: | ||
buckets = [round(min+x*increment, 2) for x in range(bucketCount+1)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this generates ranges in two digits of precision in case of floating point. I feel we should make it at least three.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And please add a test case for this too.
I'll try to review this later today or tomorrow. @davies, you might want to take a look at this, too? |
@@ -901,6 +902,97 @@ def sampleVariance(self): | |||
1.0 | |||
""" | |||
return self.stats().sampleVariance() | |||
|
|||
def histogram(self, buckets=None, evenBuckets=False, bucketCount=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can define this API as
def histogram(self, buckets, even=False):
buckets can be list or int.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. Why didn't I come up with it :)
@freeman-lab @sryza @MLnick might also be interested in this, since we also wrote a histogram function at the PySpark + scikit-learn hackathon: https://github.com/ogrisel/spylearn/blob/master/spylearn/histogram.py |
bucketNumber = (e - minimum) * 1.0 / inc # avoid Python integer div | ||
if (bucketNumber > count or bucketNumber < 0): | ||
return None | ||
return min(int(bucketNumber), count -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The right of bucket is open, so if bucketNumber is integer, the return value should be int(bucketNumer) - 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davies This part of the code is taken straight from the Scala version of the histogram API. I will investigate and get back to you.
Thank you for working on this. Last night, I just start to working on this API, it's implemented as wrapper to call Java API. [https://github.com//pull/1791/files] How do you think of this two approaches? |
Cool work guys, this would be great to have. Both approaches are likely more efficient that what we did at the hackathon, which was a simpler implementation. I'd be curious about performance between the two. The Java API version might be faster, and also easier to maintain. I could try to do some large-scale testing if useful. |
The Java API can only works better for float, this Python version can work for int, even string and complex. Should we merge this one first? In the future, may be could try to use Java API only when the rdd is JavaRDDDouble. |
I had make a version similar to this in #1791 , plz take a look at it. |
Tested and ready to merge.