SPARK-1438 RDD.sample() make seed param optional #477

arun-rama · 2014-04-22T05:28:41Z

copying form previous pull request #462

Its probably better to let the underlying language implementation take care of the default . This was easier to do with python as the default value for seed in random and numpy random is None.

In Scala/Java side it might mean propagating an Option or null(oh no!) down the chain until where the Random is constructed. But, looks like the convention in some other methods was to use System.nanoTime. So, followed that convention.

Conflict with overloaded method in sql.SchemaRDD.sample which also defines default params.
sample(fraction, withReplacement=false, seed=math.random)
Scala does not allow more than one overloaded to have default params. I believe the author intended to override the RDD.sample method and not overload it. So, changed it.

If backward compatible is important, 3 new method can be introduced (without default params) like this
sample(fraction)
sample(fraction, withReplacement)
sample(fraction, withReplacement, seed)

Added some tests for the scala RDD takeSample method.

…sample/takeSample

AmplabJenkins · 2014-04-22T05:32:57Z

Can one of the admins verify this patch?

arun-rama · 2014-04-22T05:43:11Z

@advancedxy If consistency is important then I could set default of long(time.time() * 1e9) in RDDSampler (python api) constructor like you suggested.

mateiz · 2014-04-22T06:11:19Z

Hey, FYI, it's not a good idea to use System.nanoTime as the seed because multiple RDDs created at the same time (which can easily happen due to lazy evaluation) would have the exact same seed. Use math.random() instead, or the equivalent in PySpark. Math.random is synchronized as far as I know, which is bad for high-performance random number generation but good for getting distinct numbers here.

mateiz · 2014-04-22T06:11:36Z

Same thing applies in Python, don't use the current time, call their random function.

mateiz · 2014-04-22T06:12:07Z

core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala

+      assert(sample.size === num)        // Got exactly num elements
+      assert(sample.toSet.size === num)  // Elements are distinct
+      assert(sample.forall(x => 1 <= x && x <= 100), "elements not in [1, 100]")
+    	}


Indenting seems off here, there seem to be some tabs.

will take care of the indent

advancedxy · 2014-04-22T06:15:09Z

seed: Int = (math.random * 1000).toInt)
hi, @mateiz should we use Long instead of Int to avoid collision.

mateiz · 2014-04-22T06:39:33Z

(math.random * 1000).toInt can only produce 1000 values, which is very few. You should use a much bigger number than 1000, e.g. 1e12, and then do toLong. Or you can create a static Random object somewhere and call nextLong on it.

mateiz · 2014-04-22T18:01:36Z

python/pyspark/rdd.py

@@ -381,13 +382,11 @@ def takeSample(self, withReplacement, num, seed):
        # If the first sample didn't turn out large enough, keep trying to take samples;
        # this shouldn't happen often because we use a big multiplier for their initial size.
        # See: scala/spark/RDD.scala
+        random.seed(seed)


Is this the global random object? Library code should not be setting the seed and then calling randint. Is there no equivalent of java.util.Random that you can create and use here?

…er. python: use a separate instance of Random instead of seeding language api global Random instance.

arun-rama · 2014-04-23T06:12:48Z

scala/java: Replaced System.nanoTime shared instance of Random in Utils object.
python: Made use of an independent instance of Random instead of seeding language api global Random instance.

arun-rama · 2014-04-24T05:14:30Z

@mateiz @advancedxy new commit covers all suggestions so far. Any thoughts ?

advancedxy · 2014-04-24T05:39:09Z

hi, @smartnut007, I don't think I could make the call. So, I didn't reply the question which one to use.
should ask @mateiz !

---update----
sorry, the above comment is about this question "Can you guys let me know which one ?"

advancedxy · 2014-04-24T05:53:48Z

python/pyspark/rddsampler.py

-            for _ in range(0, split):
-                # discard the next few values in the sequence to have a
-                # different seed for the different splits
-                self._random.randint(sys.maxint)
        else:
            import random


Since, we have imported random at the beginning. This line is unnecessary.

arun-rama · 2014-04-24T06:32:30Z

ok. removed redundant 'import random'

mateiz · 2014-04-24T07:36:13Z

core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala

@@ -465,7 +465,13 @@ class RDDSuite extends FunSuite with SharedSparkContext {

  test("takeSample") {
    val data = sc.parallelize(1 to 100, 2)
-
+
+    for (num <- List(5,20,100)) {


Put spaces after the commas here

mateiz · 2014-04-24T07:44:50Z

Thanks for the changes, this looks pretty good now. Made a few small comments on it.

arun-rama · 2014-04-24T10:11:23Z

Great. Fixed the space formatting as suggested.

mateiz · 2014-04-24T22:17:27Z

Jenkins, test this please

AmplabJenkins · 2014-04-24T22:17:58Z

Build triggered.

AmplabJenkins · 2014-04-24T22:18:06Z

Build started.

AmplabJenkins · 2014-04-24T22:57:15Z

Build finished. All automated tests passed.

AmplabJenkins · 2014-04-24T22:57:15Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14452/

mateiz · 2014-04-25T00:29:43Z

Thanks Arun! I've merged this in now.

copying form previous pull request #462 Its probably better to let the underlying language implementation take care of the default . This was easier to do with python as the default value for seed in random and numpy random is None. In Scala/Java side it might mean propagating an Option or null(oh no!) down the chain until where the Random is constructed. But, looks like the convention in some other methods was to use System.nanoTime. So, followed that convention. Conflict with overloaded method in sql.SchemaRDD.sample which also defines default params. sample(fraction, withReplacement=false, seed=math.random) Scala does not allow more than one overloaded to have default params. I believe the author intended to override the RDD.sample method and not overload it. So, changed it. If backward compatible is important, 3 new method can be introduced (without default params) like this sample(fraction) sample(fraction, withReplacement) sample(fraction, withReplacement, seed) Added some tests for the scala RDD takeSample method. Author: Arun Ramakrishnan <smartnut007@gmail.com> This patch had conflicts when merged, resolved by Committer: Matei Zaharia <matei@databricks.com> Closes #477 from smartnut007/master and squashes the following commits: 07bb06e [Arun Ramakrishnan] SPARK-1438 fixing more space formatting issues b9ebfe2 [Arun Ramakrishnan] SPARK-1438 removing redundant import of random in python rddsampler 8d05b1a [Arun Ramakrishnan] SPARK-1438 RDD . Replace System.nanoTime with a Random generated number. python: use a separate instance of Random instead of seeding language api global Random instance. 69619c6 [Arun Ramakrishnan] SPARK-1438 fix spacing issue 0c247db [Arun Ramakrishnan] SPARK-1438 RDD language apis to support optional seed in RDD methods sample/takeSample (cherry picked from commit 35e3d19) Signed-off-by: Matei Zaharia <matei@databricks.com>

@mridulm

Handful of 0.9 fixes This patch addresses a few fixes for Spark 0.9.0 based on the last release candidate. @mridulm gets credit for reporting most of the issues here. Many of the fixes here are based on his work in apache#477 and follow up discussion with him.

copying form previous pull request apache#462 Its probably better to let the underlying language implementation take care of the default . This was easier to do with python as the default value for seed in random and numpy random is None. In Scala/Java side it might mean propagating an Option or null(oh no!) down the chain until where the Random is constructed. But, looks like the convention in some other methods was to use System.nanoTime. So, followed that convention. Conflict with overloaded method in sql.SchemaRDD.sample which also defines default params. sample(fraction, withReplacement=false, seed=math.random) Scala does not allow more than one overloaded to have default params. I believe the author intended to override the RDD.sample method and not overload it. So, changed it. If backward compatible is important, 3 new method can be introduced (without default params) like this sample(fraction) sample(fraction, withReplacement) sample(fraction, withReplacement, seed) Added some tests for the scala RDD takeSample method. Author: Arun Ramakrishnan <smartnut007@gmail.com> This patch had conflicts when merged, resolved by Committer: Matei Zaharia <matei@databricks.com> Closes apache#477 from smartnut007/master and squashes the following commits: 07bb06e [Arun Ramakrishnan] SPARK-1438 fixing more space formatting issues b9ebfe2 [Arun Ramakrishnan] SPARK-1438 removing redundant import of random in python rddsampler 8d05b1a [Arun Ramakrishnan] SPARK-1438 RDD . Replace System.nanoTime with a Random generated number. python: use a separate instance of Random instead of seeding language api global Random instance. 69619c6 [Arun Ramakrishnan] SPARK-1438 fix spacing issue 0c247db [Arun Ramakrishnan] SPARK-1438 RDD language apis to support optional seed in RDD methods sample/takeSample

@mridulm

Handful of 0.9 fixes This patch addresses a few fixes for Spark 0.9.0 based on the last release candidate. @mridulm gets credit for reporting most of the issues here. Many of the fixes here are based on his work in apache#477 and follow up discussion with him. (cherry picked from commit 77b986f) Signed-off-by: Patrick Wendell <pwendell@gmail.com>

Same change has been made to build due to deadlock in maven shadow plugin

Update default go to latest - 1.12.1

…gs in SHS (apache#477)

…#479) * Revert "KE-37052 translate boolean column to V2Predicate (apache#477)" This reverts commit 7796f19. * KE-37052 translate boolean column to V2Predicate (apache#476) * KE-37052 translate boolean column to V2Predicate * update spark version

arun-rama added 2 commits April 21, 2014 10:36

SPARK-1438 RDD language apis to support optional seed in RDD methods …

0c247db

…sample/takeSample

SPARK-1438 fix spacing issue

69619c6

arun-rama mentioned this pull request Apr 22, 2014

SPARK-1438 RDD make seed optional in RDD methods sam... #462

Closed

mateiz reviewed Apr 22, 2014
View reviewed changes

SPARK-1438 RDD . Replace System.nanoTime with a Random generated numb…

8d05b1a

…er. python: use a separate instance of Random instead of seeding language api global Random instance.

advancedxy reviewed Apr 24, 2014
View reviewed changes

SPARK-1438 removing redundant import of random in python rddsampler

b9ebfe2

mateiz reviewed Apr 24, 2014
View reviewed changes

SPARK-1438 fixing more space formatting issues

07bb06e

asfgit closed this in 35e3d19 Apr 25, 2014

markhamstra pushed a commit to markhamstra/spark that referenced this pull request Nov 7, 2017

Use paths to read small local files instead of URIs (apache#477)

bc845c3

mccheah pushed a commit to mccheah/spark that referenced this pull request Feb 14, 2019

Fix semantic merge conflicts (apache#477)

22bb5cc

Same change has been made to build due to deadlock in maven shadow plugin

bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019

Merge pull request apache#477 from mrhillsman/updatedefaultgoversion

c8d71d2

Update default go to latest - 1.12.1

arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020

MapR [SPARK-510] nonmapr "admin" users not able to view other user lo…

9a89ac8

…gs in SHS (apache#477)

RolatZhang pushed a commit to RolatZhang/spark that referenced this pull request Aug 15, 2022

KE-37052 translate boolean column to V2Predicate (apache#477)

7796f19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SPARK-1438 RDD.sample() make seed param optional #477

SPARK-1438 RDD.sample() make seed param optional #477

arun-rama commented Apr 22, 2014

AmplabJenkins commented Apr 22, 2014

arun-rama commented Apr 22, 2014

mateiz commented Apr 22, 2014

mateiz commented Apr 22, 2014

mateiz Apr 22, 2014

arun-rama Apr 22, 2014

advancedxy commented Apr 22, 2014

mateiz commented Apr 22, 2014

mateiz Apr 22, 2014

arun-rama commented Apr 23, 2014

arun-rama commented Apr 24, 2014

advancedxy commented Apr 24, 2014

advancedxy Apr 24, 2014

arun-rama commented Apr 24, 2014

mateiz Apr 24, 2014

mateiz commented Apr 24, 2014

arun-rama commented Apr 24, 2014

mateiz commented Apr 24, 2014

AmplabJenkins commented Apr 24, 2014

AmplabJenkins commented Apr 24, 2014

AmplabJenkins commented Apr 24, 2014

AmplabJenkins commented Apr 24, 2014

mateiz commented Apr 25, 2014

SPARK-1438 RDD.sample() make seed param optional #477

SPARK-1438 RDD.sample() make seed param optional #477

Conversation

arun-rama commented Apr 22, 2014

AmplabJenkins commented Apr 22, 2014

arun-rama commented Apr 22, 2014

mateiz commented Apr 22, 2014

mateiz commented Apr 22, 2014

mateiz Apr 22, 2014

Choose a reason for hiding this comment

arun-rama Apr 22, 2014

Choose a reason for hiding this comment

advancedxy commented Apr 22, 2014

mateiz commented Apr 22, 2014

mateiz Apr 22, 2014

Choose a reason for hiding this comment

arun-rama commented Apr 23, 2014

arun-rama commented Apr 24, 2014

advancedxy commented Apr 24, 2014

advancedxy Apr 24, 2014

Choose a reason for hiding this comment

arun-rama commented Apr 24, 2014

mateiz Apr 24, 2014

Choose a reason for hiding this comment

mateiz commented Apr 24, 2014

arun-rama commented Apr 24, 2014

mateiz commented Apr 24, 2014

AmplabJenkins commented Apr 24, 2014

AmplabJenkins commented Apr 24, 2014

AmplabJenkins commented Apr 24, 2014

AmplabJenkins commented Apr 24, 2014

mateiz commented Apr 25, 2014