Skip to content

[SPARK-2724] Python version of RandomRDDGenerators #1628

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from

Conversation

dorx
Copy link
Contributor

@dorx dorx commented Jul 29, 2014

RandomRDDGenerators but without support for randomRDD and randomVectorRDD, which take in arbitrary DistributionGenerator.

randomRDD.py is named to avoid collision with the built-in Python random package.

@SparkQA
Copy link

SparkQA commented Jul 29, 2014

QA tests have started for PR 1628. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17329/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 29, 2014

QA results for PR 1628:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17329/consoleFull

/**
* Java stub for Python mllib RandomRDDGenerators.poissonVectorRDD()
*/
def poissonVectorRDD(jsc: JavaSparkContext,mean: Double,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move mean: Double to next line

@mengxr
Copy link
Contributor

mengxr commented Jul 29, 2014

LGTM except minor inline comments. For the file name, it should be possible to have a package named random, for example, numpy.random: http://docs.scipy.org/doc/numpy/reference/routines.random.html

@dorx
Copy link
Contributor Author

dorx commented Jul 29, 2014

In NumPy's source, they had a directory named random: https://github.com/numpy/numpy/tree/master/numpy/random
It seems like having directory hierarchy is the only way to organize packages:
https://docs.python.org/2/tutorial/modules.html#packages
In the flat structure that we have right now, naming a file random.py would override the python random package. We could either break the flat structure we currently have, rename the package to something else in Scala (although I don't know what's a good alternative), or let it be named something other than random. I'm okay with any of these options.

@SparkQA
Copy link

SparkQA commented Jul 29, 2014

QA tests have started for PR 1628. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17370/consoleFull

@mengxr
Copy link
Contributor

mengxr commented Jul 29, 2014

Yes, having directories is the way to organize packages in python. We can make a folder for random and include the python files in mllib/pom.xml. Otherwise, user needs from pyspark.mllib.randomRDD import RandomRDDGenerators, which is a little strange.

@SparkQA
Copy link

SparkQA commented Jul 29, 2014

QA results for PR 1628:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17370/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 29, 2014

QA tests have started for PR 1628. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17377/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 29, 2014

QA results for PR 1628:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17377/consoleFull

return long(getrandbits(63))


def _test():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For these tests to run automatically, you also need to add this file into the python/run-tests script. Otherwise it won't automatically discover it, e.g. in Jenkins.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep caught that while looking inside run-tests. Thanks for the reminder.

from pyspark.rdd import RDD
from pyspark.mllib._common import _deserialize_double, _deserialize_double_vector
from pyspark.serializers import NoOpSerializer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a doc comment to this package similar to the one on RandomRDDGenerators.scala

@SparkQA
Copy link

SparkQA commented Jul 30, 2014

QA tests have started for PR 1628. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17407/consoleFull

@dorx
Copy link
Contributor Author

dorx commented Jul 30, 2014

Btw from pyspark.mllib import random now works with the latest commit in the pyspark shell.

@SparkQA
Copy link

SparkQA commented Jul 30, 2014

QA results for PR 1628:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17407/consoleFull

@mengxr
Copy link
Contributor

mengxr commented Jul 30, 2014

@dorx I tried import pyspark.mllib.random and it failed. It has to be from pyspark.mllib import random. And to use RandomRDDGenerators, I need to call random.RandomRDDGenerators. Ideally, it should be from pyspark.mllib.random import RandomRDDGenerators. If we know how to handle the name random now, maybe we can create random.py under mllib and define class RandomRDDGenerators there. If it is not easy to do that because of python's own random package, it should be fine to rename the package name to rand in both Python and Scala.

@mateiz
Copy link
Contributor

mateiz commented Jul 30, 2014

If you can't figure out whether this is possible, consider pinging Josh or Davies too. I'd be surprised if there's no way around this because there are a lot of top-level packages in Python. There's gotta be a way to import our own vs importing theirs.

@mengxr
Copy link
Contributor

mengxr commented Jul 30, 2014

@JoshRosen If we don't support 2.5, could we use from __future__ import absolute_import?

@JoshRosen
Copy link
Contributor

@mengxr Yeah, that would be okay to use but it turns out that it doesn't solve the lin_alg.py problem.

from __future__ import absolute_import enables absolute imports, but only in the file that contains it. In the failing linalg.py example, the problematic import random occurred in third-party code. If they added from __future__ import absolute_import, things would work fine.

@SparkQA
Copy link

SparkQA commented Jul 30, 2014

QA results for PR 1628:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17498/consoleFull

@@ -49,6 +49,12 @@
Main entry point for accessing data stored in Apache Hive..
"""


import sys
s = sys.path.pop(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely need some comments here to explain what is going on.

and added docs for hacks that allow us to keep the module name
mllib.random.
@SparkQA
Copy link

SparkQA commented Jul 31, 2014

QA tests have started for PR 1628. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17606/consoleFull


>>> x = RandomRDDGenerators.normalRDD(sc, 1000, seed=1L).collect()
>>> from pyspark.statcounter import StatCounter
>>> stats = StatCounter(x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stats = x.stats()

@mengxr
Copy link
Contributor

mengxr commented Jul 31, 2014

LGTM except minor inline comments.

@SparkQA
Copy link

SparkQA commented Jul 31, 2014

QA tests have started for PR 1628. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17610/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 31, 2014

QA results for PR 1628:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17606/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 31, 2014

QA results for PR 1628:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17610/consoleFull

@mengxr
Copy link
Contributor

mengxr commented Aug 1, 2014

Merged into master. Thanks!!

@asfgit asfgit closed this in d843014 Aug 1, 2014
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
RandomRDDGenerators but without support for randomRDD and randomVectorRDD, which take in arbitrary DistributionGenerator.

`randomRDD.py` is named to avoid collision with the built-in Python `random` package.

Author: Doris Xin <doris.s.xin@gmail.com>

Closes apache#1628 from dorx/pythonRDD and squashes the following commits:

55c6de8 [Doris Xin] review comments. all python units passed.
f831d9b [Doris Xin] moved default args logic into PythonMLLibAPI
2d73917 [Doris Xin] fix for linalg.py
8663e6a [Doris Xin] reverting back to a single python file for random
f47c481 [Doris Xin] docs update
687aac0 [Doris Xin] add RandomRDDGenerators.py to run-tests
4338f40 [Doris Xin] renamed randomRDD to rand and import as random
29d205e [Doris Xin] created mllib.random package
bd2df13 [Doris Xin] typos
07ddff2 [Doris Xin] units passed.
23b2ecd [Doris Xin] WIP
sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants