Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SPARK-1438 RDD.sample() make seed param optional #477

Closed
wants to merge 5 commits into from

Conversation

arun-rama
Copy link
Contributor

copying form previous pull request #462

Its probably better to let the underlying language implementation take care of the default . This was easier to do with python as the default value for seed in random and numpy random is None.

In Scala/Java side it might mean propagating an Option or null(oh no!) down the chain until where the Random is constructed. But, looks like the convention in some other methods was to use System.nanoTime. So, followed that convention.

Conflict with overloaded method in sql.SchemaRDD.sample which also defines default params.
sample(fraction, withReplacement=false, seed=math.random)
Scala does not allow more than one overloaded to have default params. I believe the author intended to override the RDD.sample method and not overload it. So, changed it.

If backward compatible is important, 3 new method can be introduced (without default params) like this
sample(fraction)
sample(fraction, withReplacement)
sample(fraction, withReplacement, seed)

Added some tests for the scala RDD takeSample method.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@arun-rama
Copy link
Contributor Author

@advancedxy If consistency is important then I could set default of long(time.time() * 1e9) in RDDSampler (python api) constructor like you suggested.

@mateiz
Copy link
Contributor

mateiz commented Apr 22, 2014

Hey, FYI, it's not a good idea to use System.nanoTime as the seed because multiple RDDs created at the same time (which can easily happen due to lazy evaluation) would have the exact same seed. Use math.random() instead, or the equivalent in PySpark. Math.random is synchronized as far as I know, which is bad for high-performance random number generation but good for getting distinct numbers here.

@mateiz
Copy link
Contributor

mateiz commented Apr 22, 2014

Same thing applies in Python, don't use the current time, call their random function.

assert(sample.size === num) // Got exactly num elements
assert(sample.toSet.size === num) // Elements are distinct
assert(sample.forall(x => 1 <= x && x <= 100), "elements not in [1, 100]")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indenting seems off here, there seem to be some tabs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will take care of the indent

@advancedxy
Copy link
Contributor

seed: Int = (math.random * 1000).toInt)
hi, @mateiz should we use Long instead of Int to avoid collision.

@mateiz
Copy link
Contributor

mateiz commented Apr 22, 2014

(math.random * 1000).toInt can only produce 1000 values, which is very few. You should use a much bigger number than 1000, e.g. 1e12, and then do toLong. Or you can create a static Random object somewhere and call nextLong on it.

@@ -381,13 +382,11 @@ def takeSample(self, withReplacement, num, seed):
# If the first sample didn't turn out large enough, keep trying to take samples;
# this shouldn't happen often because we use a big multiplier for their initial size.
# See: scala/spark/RDD.scala
random.seed(seed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the global random object? Library code should not be setting the seed and then calling randint. Is there no equivalent of java.util.Random that you can create and use here?

…er. python: use a separate instance of Random instead of seeding language api global Random instance.
@arun-rama
Copy link
Contributor Author

scala/java: Replaced System.nanoTime shared instance of Random in Utils object.
python: Made use of an independent instance of Random instead of seeding language api global Random instance.

@arun-rama
Copy link
Contributor Author

@mateiz @advancedxy new commit covers all suggestions so far. Any thoughts ?

@advancedxy
Copy link
Contributor

hi, @smartnut007, I don't think I could make the call. So, I didn't reply the question which one to use.
should ask @mateiz !

---update----
sorry, the above comment is about this question "Can you guys let me know which one ?"

for _ in range(0, split):
# discard the next few values in the sequence to have a
# different seed for the different splits
self._random.randint(sys.maxint)
else:
import random
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since, we have imported random at the beginning. This line is unnecessary.

@arun-rama
Copy link
Contributor Author

ok. removed redundant 'import random'

@@ -465,7 +465,13 @@ class RDDSuite extends FunSuite with SharedSparkContext {

test("takeSample") {
val data = sc.parallelize(1 to 100, 2)


for (num <- List(5,20,100)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put spaces after the commas here

@mateiz
Copy link
Contributor

mateiz commented Apr 24, 2014

Thanks for the changes, this looks pretty good now. Made a few small comments on it.

@arun-rama
Copy link
Contributor Author

Great. Fixed the space formatting as suggested.

@mateiz
Copy link
Contributor

mateiz commented Apr 24, 2014

Jenkins, test this please

@AmplabJenkins
Copy link

Build triggered.

@AmplabJenkins
Copy link

Build started.

@AmplabJenkins
Copy link

Build finished. All automated tests passed.

@AmplabJenkins
Copy link

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14452/

@mateiz
Copy link
Contributor

mateiz commented Apr 25, 2014

Thanks Arun! I've merged this in now.

asfgit pushed a commit that referenced this pull request Apr 25, 2014
copying form previous pull request #462

Its probably better to let the underlying language implementation take care of the default . This was easier to do with python as the default value for seed in random and numpy random is None.

In Scala/Java side it might mean propagating an Option or null(oh no!) down the chain until where the Random is constructed. But, looks like the convention in some other methods was to use System.nanoTime. So, followed that convention.

Conflict with overloaded method in sql.SchemaRDD.sample which also defines default params.
sample(fraction, withReplacement=false, seed=math.random)
Scala does not allow more than one overloaded to have default params. I believe the author intended to override the RDD.sample method and not overload it. So, changed it.

If backward compatible is important, 3 new method can be introduced (without default params) like this
sample(fraction)
sample(fraction, withReplacement)
sample(fraction, withReplacement, seed)

Added some tests for the scala RDD takeSample method.

Author: Arun Ramakrishnan <smartnut007@gmail.com>

This patch had conflicts when merged, resolved by
Committer: Matei Zaharia <matei@databricks.com>

Closes #477 from smartnut007/master and squashes the following commits:

07bb06e [Arun Ramakrishnan] SPARK-1438 fixing more space formatting issues
b9ebfe2 [Arun Ramakrishnan] SPARK-1438 removing redundant import of random in python rddsampler
8d05b1a [Arun Ramakrishnan] SPARK-1438 RDD . Replace System.nanoTime with a Random generated number. python: use a separate instance of Random instead of seeding language api global Random instance.
69619c6 [Arun Ramakrishnan] SPARK-1438 fix spacing issue
0c247db [Arun Ramakrishnan] SPARK-1438 RDD language apis to support optional seed in RDD methods sample/takeSample

(cherry picked from commit 35e3d19)
Signed-off-by: Matei Zaharia <matei@databricks.com>
@asfgit asfgit closed this in 35e3d19 Apr 25, 2014
pwendell added a commit to pwendell/spark that referenced this pull request May 12, 2014
Handful of 0.9 fixes

This patch addresses a few fixes for Spark 0.9.0 based on the last release candidate.

@mridulm gets credit for reporting most of the issues here. Many of the fixes here are based on his work in apache#477 and follow up discussion with him.
pdeyhim pushed a commit to pdeyhim/spark-1 that referenced this pull request Jun 25, 2014
copying form previous pull request apache#462

Its probably better to let the underlying language implementation take care of the default . This was easier to do with python as the default value for seed in random and numpy random is None.

In Scala/Java side it might mean propagating an Option or null(oh no!) down the chain until where the Random is constructed. But, looks like the convention in some other methods was to use System.nanoTime. So, followed that convention.

Conflict with overloaded method in sql.SchemaRDD.sample which also defines default params.
sample(fraction, withReplacement=false, seed=math.random)
Scala does not allow more than one overloaded to have default params. I believe the author intended to override the RDD.sample method and not overload it. So, changed it.

If backward compatible is important, 3 new method can be introduced (without default params) like this
sample(fraction)
sample(fraction, withReplacement)
sample(fraction, withReplacement, seed)

Added some tests for the scala RDD takeSample method.

Author: Arun Ramakrishnan <smartnut007@gmail.com>

This patch had conflicts when merged, resolved by
Committer: Matei Zaharia <matei@databricks.com>

Closes apache#477 from smartnut007/master and squashes the following commits:

07bb06e [Arun Ramakrishnan] SPARK-1438 fixing more space formatting issues
b9ebfe2 [Arun Ramakrishnan] SPARK-1438 removing redundant import of random in python rddsampler
8d05b1a [Arun Ramakrishnan] SPARK-1438 RDD . Replace System.nanoTime with a Random generated number. python: use a separate instance of Random instead of seeding language api global Random instance.
69619c6 [Arun Ramakrishnan] SPARK-1438 fix spacing issue
0c247db [Arun Ramakrishnan] SPARK-1438 RDD language apis to support optional seed in RDD methods sample/takeSample
andrewor14 pushed a commit to andrewor14/spark that referenced this pull request Jan 8, 2015
Handful of 0.9 fixes

This patch addresses a few fixes for Spark 0.9.0 based on the last release candidate.

@mridulm gets credit for reporting most of the issues here. Many of the fixes here are based on his work in apache#477 and follow up discussion with him.
(cherry picked from commit 77b986f)

Signed-off-by: Patrick Wendell <pwendell@gmail.com>
markhamstra pushed a commit to markhamstra/spark that referenced this pull request Nov 7, 2017
mccheah pushed a commit to mccheah/spark that referenced this pull request Feb 14, 2019
Same change has been made to build due to deadlock in maven shadow plugin
bzhaoopenstack pushed a commit to bzhaoopenstack/spark that referenced this pull request Sep 11, 2019
arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020
RolatZhang pushed a commit to RolatZhang/spark that referenced this pull request Aug 15, 2022
RolatZhang pushed a commit to RolatZhang/spark that referenced this pull request Aug 15, 2022
…#479)

* Revert "KE-37052 translate boolean column to V2Predicate (apache#477)"

This reverts commit 7796f19.

* KE-37052 translate boolean column to V2Predicate (apache#476)

* KE-37052 translate boolean column to V2Predicate

* update spark version
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants