[SPARK-2724] Python version of RandomRDDGenerators #1628

dorx · 2014-07-29T02:09:55Z

RandomRDDGenerators but without support for randomRDD and randomVectorRDD, which take in arbitrary DistributionGenerator.

randomRDD.py is named to avoid collision with the built-in Python random package.

SparkQA · 2014-07-29T02:13:51Z

QA tests have started for PR 1628. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17329/consoleFull

SparkQA · 2014-07-29T03:01:32Z

QA results for PR 1628:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17329/consoleFull

mengxr · 2014-07-29T04:54:49Z

mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala

+  /**
+   * Java stub for Python mllib RandomRDDGenerators.poissonVectorRDD()
+   */
+  def poissonVectorRDD(jsc: JavaSparkContext,mean: Double,


move mean: Double to next line

mengxr · 2014-07-29T05:08:24Z

LGTM except minor inline comments. For the file name, it should be possible to have a package named random, for example, numpy.random: http://docs.scipy.org/doc/numpy/reference/routines.random.html

dorx · 2014-07-29T19:26:08Z

In NumPy's source, they had a directory named random: https://github.com/numpy/numpy/tree/master/numpy/random
It seems like having directory hierarchy is the only way to organize packages:
https://docs.python.org/2/tutorial/modules.html#packages
In the flat structure that we have right now, naming a file random.py would override the python random package. We could either break the flat structure we currently have, rename the package to something else in Scala (although I don't know what's a good alternative), or let it be named something other than random. I'm okay with any of these options.

SparkQA · 2014-07-29T19:28:57Z

QA tests have started for PR 1628. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17370/consoleFull

mengxr · 2014-07-29T19:57:09Z

Yes, having directories is the way to organize packages in python. We can make a folder for random and include the python files in mllib/pom.xml. Otherwise, user needs from pyspark.mllib.randomRDD import RandomRDDGenerators, which is a little strange.

SparkQA · 2014-07-29T20:19:10Z

QA results for PR 1628:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17370/consoleFull

SparkQA · 2014-07-29T20:49:03Z

QA tests have started for PR 1628. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17377/consoleFull

SparkQA · 2014-07-29T21:39:19Z

QA results for PR 1628:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17377/consoleFull

mateiz · 2014-07-30T02:05:48Z

python/pyspark/mllib/random/RandomRDDGenerators.py

+    return long(getrandbits(63))
+
+
+def _test():


For these tests to run automatically, you also need to add this file into the python/run-tests script. Otherwise it won't automatically discover it, e.g. in Jenkins.

Yep caught that while looking inside run-tests. Thanks for the reminder.

mateiz · 2014-07-30T02:07:03Z

python/pyspark/mllib/random/RandomRDDGenerators.py

+from pyspark.rdd import RDD
+from pyspark.mllib._common import _deserialize_double, _deserialize_double_vector
+from pyspark.serializers import NoOpSerializer
+


Add a doc comment to this package similar to the one on RandomRDDGenerators.scala

SparkQA · 2014-07-30T02:23:58Z

QA tests have started for PR 1628. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17407/consoleFull

dorx · 2014-07-30T02:43:40Z

Btw from pyspark.mllib import random now works with the latest commit in the pyspark shell.

SparkQA · 2014-07-30T03:10:09Z

QA results for PR 1628:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17407/consoleFull

mengxr · 2014-07-30T05:11:49Z

@dorx I tried import pyspark.mllib.random and it failed. It has to be from pyspark.mllib import random. And to use RandomRDDGenerators, I need to call random.RandomRDDGenerators. Ideally, it should be from pyspark.mllib.random import RandomRDDGenerators. If we know how to handle the name random now, maybe we can create random.py under mllib and define class RandomRDDGenerators there. If it is not easy to do that because of python's own random package, it should be fine to rename the package name to rand in both Python and Scala.

mateiz · 2014-07-30T05:36:23Z

If you can't figure out whether this is possible, consider pinging Josh or Davies too. I'd be surprised if there's no way around this because there are a lot of top-level packages in Python. There's gotta be a way to import our own vs importing theirs.

mengxr · 2014-07-30T22:51:03Z

@JoshRosen If we don't support 2.5, could we use from __future__ import absolute_import?

JoshRosen · 2014-07-30T22:59:07Z

@mengxr Yeah, that would be okay to use but it turns out that it doesn't solve the lin_alg.py problem.

from __future__ import absolute_import enables absolute imports, but only in the file that contains it. In the failing linalg.py example, the problematic import random occurred in third-party code. If they added from __future__ import absolute_import, things would work fine.

SparkQA · 2014-07-30T23:33:45Z

QA results for PR 1628:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17498/consoleFull

mengxr · 2014-07-31T17:53:16Z

python/pyspark/__init__.py

@@ -49,6 +49,12 @@
      Main entry point for accessing data stored in Apache Hive..
 """

+
+import sys
+s = sys.path.pop(0)


We definitely need some comments here to explain what is going on.

and added docs for hacks that allow us to keep the module name mllib.random.

SparkQA · 2014-07-31T20:59:12Z

QA tests have started for PR 1628. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17606/consoleFull

mengxr · 2014-07-31T21:16:21Z

python/pyspark/mllib/random.py

+
+        >>> x = RandomRDDGenerators.normalRDD(sc, 1000, seed=1L).collect()
+        >>> from pyspark.statcounter import StatCounter
+        >>> stats = StatCounter(x)


stats = x.stats()

mengxr · 2014-07-31T21:22:52Z

LGTM except minor inline comments.

SparkQA · 2014-07-31T21:39:02Z

QA tests have started for PR 1628. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17610/consoleFull

SparkQA · 2014-07-31T21:51:24Z

QA results for PR 1628:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17606/consoleFull

SparkQA · 2014-07-31T22:33:12Z

QA results for PR 1628:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17610/consoleFull

mengxr · 2014-08-01T03:34:44Z

Merged into master. Thanks!!

RandomRDDGenerators but without support for randomRDD and randomVectorRDD, which take in arbitrary DistributionGenerator. `randomRDD.py` is named to avoid collision with the built-in Python `random` package. Author: Doris Xin <doris.s.xin@gmail.com> Closes apache#1628 from dorx/pythonRDD and squashes the following commits: 55c6de8 [Doris Xin] review comments. all python units passed. f831d9b [Doris Xin] moved default args logic into PythonMLLibAPI 2d73917 [Doris Xin] fix for linalg.py 8663e6a [Doris Xin] reverting back to a single python file for random f47c481 [Doris Xin] docs update 687aac0 [Doris Xin] add RandomRDDGenerators.py to run-tests 4338f40 [Doris Xin] renamed randomRDD to rand and import as random 29d205e [Doris Xin] created mllib.random package bd2df13 [Doris Xin] typos 07ddff2 [Doris Xin] units passed. 23b2ecd [Doris Xin] WIP

dorx added 2 commits July 28, 2014 15:32

WIP

23b2ecd

units passed.

07ddff2

mengxr reviewed Jul 29, 2014
View reviewed changes

typos

bd2df13

created mllib.random package

29d205e

renamed randomRDD to rand and import as random

4338f40

mateiz reviewed Jul 30, 2014
View reviewed changes

add RandomRDDGenerators.py to run-tests

687aac0

mateiz reviewed Jul 30, 2014
View reviewed changes

docs update

f47c481

mengxr reviewed Jul 31, 2014
View reviewed changes

moved default args logic into PythonMLLibAPI

f831d9b

and added docs for hacks that allow us to keep the module name mllib.random.

mengxr reviewed Jul 31, 2014
View reviewed changes

review comments. all python units passed.

55c6de8

asfgit closed this in d843014 Aug 1, 2014

JoshRosen mentioned this pull request Aug 25, 2014

[PySpark][Streaming][SPARK-2377] Python API for Spark Streaming tdas/spark#11

Closed

JoshRosen mentioned this pull request Sep 26, 2014

[SPARK-2377] Python API for Streaming #2538

Closed

sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023

rdar://102226950 (Release ADT 1.1.13) (apache#1628)

2b43413

[SPARK-2724] Python version of RandomRDDGenerators #1628

[SPARK-2724] Python version of RandomRDDGenerators #1628

Uh oh!

Conversation

dorx commented Jul 29, 2014

Uh oh!

SparkQA commented Jul 29, 2014

Uh oh!

SparkQA commented Jul 29, 2014

Uh oh!

mengxr Jul 29, 2014

Choose a reason for hiding this comment

Uh oh!

mengxr commented Jul 29, 2014

Uh oh!

dorx commented Jul 29, 2014

Uh oh!

SparkQA commented Jul 29, 2014

Uh oh!

mengxr commented Jul 29, 2014

Uh oh!

SparkQA commented Jul 29, 2014

Uh oh!

SparkQA commented Jul 29, 2014

Uh oh!

SparkQA commented Jul 29, 2014

Uh oh!

mateiz Jul 30, 2014

Choose a reason for hiding this comment

Uh oh!

dorx Jul 30, 2014

Choose a reason for hiding this comment

Uh oh!

mateiz Jul 30, 2014

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 30, 2014

Uh oh!

dorx commented Jul 30, 2014

Uh oh!

SparkQA commented Jul 30, 2014

Uh oh!

mengxr commented Jul 30, 2014

Uh oh!

mateiz commented Jul 30, 2014

Uh oh!

mengxr commented Jul 30, 2014

Uh oh!

JoshRosen commented Jul 30, 2014

Uh oh!

SparkQA commented Jul 30, 2014

Uh oh!

mengxr Jul 31, 2014

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 31, 2014

Uh oh!

mengxr Jul 31, 2014

Choose a reason for hiding this comment

Uh oh!

mengxr commented Jul 31, 2014

Uh oh!

SparkQA commented Jul 31, 2014

Uh oh!

SparkQA commented Jul 31, 2014

Uh oh!

SparkQA commented Jul 31, 2014

Uh oh!

mengxr commented Aug 1, 2014

Uh oh!

Uh oh!