Skip to content

[SPARK-1170] Add histogram method to Python's RDD API #1783

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from

Conversation

nrchandan
Copy link

Tested and ready to merge.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@ScrapCodes
Copy link
Member

Jenkins, test this please.

max = mm_stats.max()
increment = (max - min) * 1.0 / bucketCount
if increment != 0:
buckets = [round(min+x*increment, 2) for x in range(bucketCount+1)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this generates ranges in two digits of precision in case of floating point. I feel we should make it at least three.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And please add a test case for this too.

@JoshRosen
Copy link
Contributor

I'll try to review this later today or tomorrow.

@davies, you might want to take a look at this, too?

@@ -901,6 +902,97 @@ def sampleVariance(self):
1.0
"""
return self.stats().sampleVariance()

def histogram(self, buckets=None, evenBuckets=False, bucketCount=None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can define this API as

def histogram(self, buckets, even=False):

buckets can be list or int.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Why didn't I come up with it :)

@JoshRosen
Copy link
Contributor

@freeman-lab @sryza @MLnick might also be interested in this, since we also wrote a histogram function at the PySpark + scikit-learn hackathon: https://github.com/ogrisel/spylearn/blob/master/spylearn/histogram.py

bucketNumber = (e - minimum) * 1.0 / inc # avoid Python integer div
if (bucketNumber > count or bucketNumber < 0):
return None
return min(int(bucketNumber), count -1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The right of bucket is open, so if bucketNumber is integer, the return value should be int(bucketNumer) - 1

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davies This part of the code is taken straight from the Scala version of the histogram API. I will investigate and get back to you.

@davies
Copy link
Contributor

davies commented Aug 5, 2014

Thank you for working on this.

Last night, I just start to working on this API, it's implemented as wrapper to call Java API. [https://github.com//pull/1791/files]

How do you think of this two approaches?

@freeman-lab
Copy link
Contributor

Cool work guys, this would be great to have. Both approaches are likely more efficient that what we did at the hackathon, which was a simpler implementation. I'd be curious about performance between the two. The Java API version might be faster, and also easier to maintain. I could try to do some large-scale testing if useful.

@davies
Copy link
Contributor

davies commented Aug 5, 2014

The Java API can only works better for float, this Python version can work for int, even string and complex.

Should we merge this one first? In the future, may be could try to use Java API only when the rdd is JavaRDDDouble.

@davies
Copy link
Contributor

davies commented Aug 6, 2014

I had make a version similar to this in #1791 , plz take a look at it.

@nrchandan
Copy link
Author

@davies I'll go through your code today. I think you meant #1791

@nrchandan
Copy link
Author

@davies #1791 looks good. Feel free to close this one as duplicate.

@nrchandan nrchandan closed this Aug 6, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants