-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-2724] Python version of RandomRDDGenerators #1628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
QA tests have started for PR 1628. This patch merges cleanly. |
QA results for PR 1628: |
/** | ||
* Java stub for Python mllib RandomRDDGenerators.poissonVectorRDD() | ||
*/ | ||
def poissonVectorRDD(jsc: JavaSparkContext,mean: Double, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move mean: Double
to next line
LGTM except minor inline comments. For the file name, it should be possible to have a package named |
In NumPy's source, they had a directory named random: https://github.com/numpy/numpy/tree/master/numpy/random |
QA tests have started for PR 1628. This patch merges cleanly. |
Yes, having directories is the way to organize packages in python. We can make a folder for |
QA results for PR 1628: |
QA tests have started for PR 1628. This patch merges cleanly. |
QA results for PR 1628: |
return long(getrandbits(63)) | ||
|
||
|
||
def _test(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For these tests to run automatically, you also need to add this file into the python/run-tests
script. Otherwise it won't automatically discover it, e.g. in Jenkins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep caught that while looking inside run-tests. Thanks for the reminder.
from pyspark.rdd import RDD | ||
from pyspark.mllib._common import _deserialize_double, _deserialize_double_vector | ||
from pyspark.serializers import NoOpSerializer | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a doc comment to this package similar to the one on RandomRDDGenerators.scala
QA tests have started for PR 1628. This patch merges cleanly. |
Btw |
QA results for PR 1628: |
@dorx I tried |
If you can't figure out whether this is possible, consider pinging Josh or Davies too. I'd be surprised if there's no way around this because there are a lot of top-level packages in Python. There's gotta be a way to import our own vs importing theirs. |
@JoshRosen If we don't support 2.5, could we use |
@mengxr Yeah, that would be okay to use but it turns out that it doesn't solve the
|
QA results for PR 1628: |
@@ -49,6 +49,12 @@ | |||
Main entry point for accessing data stored in Apache Hive.. | |||
""" | |||
|
|||
|
|||
import sys | |||
s = sys.path.pop(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely need some comments here to explain what is going on.
and added docs for hacks that allow us to keep the module name mllib.random.
QA tests have started for PR 1628. This patch merges cleanly. |
|
||
>>> x = RandomRDDGenerators.normalRDD(sc, 1000, seed=1L).collect() | ||
>>> from pyspark.statcounter import StatCounter | ||
>>> stats = StatCounter(x) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stats = x.stats()
LGTM except minor inline comments. |
QA tests have started for PR 1628. This patch merges cleanly. |
QA results for PR 1628: |
QA results for PR 1628: |
Merged into master. Thanks!! |
RandomRDDGenerators but without support for randomRDD and randomVectorRDD, which take in arbitrary DistributionGenerator. `randomRDD.py` is named to avoid collision with the built-in Python `random` package. Author: Doris Xin <doris.s.xin@gmail.com> Closes apache#1628 from dorx/pythonRDD and squashes the following commits: 55c6de8 [Doris Xin] review comments. all python units passed. f831d9b [Doris Xin] moved default args logic into PythonMLLibAPI 2d73917 [Doris Xin] fix for linalg.py 8663e6a [Doris Xin] reverting back to a single python file for random f47c481 [Doris Xin] docs update 687aac0 [Doris Xin] add RandomRDDGenerators.py to run-tests 4338f40 [Doris Xin] renamed randomRDD to rand and import as random 29d205e [Doris Xin] created mllib.random package bd2df13 [Doris Xin] typos 07ddff2 [Doris Xin] units passed. 23b2ecd [Doris Xin] WIP
RandomRDDGenerators but without support for randomRDD and randomVectorRDD, which take in arbitrary DistributionGenerator.
randomRDD.py
is named to avoid collision with the built-in Pythonrandom
package.