-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SPARK-1438 RDD.sample() make seed param optional #477
Conversation
Can one of the admins verify this patch? |
@advancedxy If consistency is important then I could set default of long(time.time() * 1e9) in RDDSampler (python api) constructor like you suggested. |
Hey, FYI, it's not a good idea to use System.nanoTime as the seed because multiple RDDs created at the same time (which can easily happen due to lazy evaluation) would have the exact same seed. Use math.random() instead, or the equivalent in PySpark. Math.random is synchronized as far as I know, which is bad for high-performance random number generation but good for getting distinct numbers here. |
Same thing applies in Python, don't use the current time, call their random function. |
assert(sample.size === num) // Got exactly num elements | ||
assert(sample.toSet.size === num) // Elements are distinct | ||
assert(sample.forall(x => 1 <= x && x <= 100), "elements not in [1, 100]") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indenting seems off here, there seem to be some tabs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will take care of the indent
|
|
@@ -381,13 +382,11 @@ def takeSample(self, withReplacement, num, seed): | |||
# If the first sample didn't turn out large enough, keep trying to take samples; | |||
# this shouldn't happen often because we use a big multiplier for their initial size. | |||
# See: scala/spark/RDD.scala | |||
random.seed(seed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the global random object? Library code should not be setting the seed and then calling randint. Is there no equivalent of java.util.Random that you can create and use here?
…er. python: use a separate instance of Random instead of seeding language api global Random instance.
scala/java: Replaced System.nanoTime shared instance of Random in Utils object. |
@mateiz @advancedxy new commit covers all suggestions so far. Any thoughts ? |
hi, @smartnut007, I don't think I could make the call. So, I didn't reply the question which one to use. ---update---- |
for _ in range(0, split): | ||
# discard the next few values in the sequence to have a | ||
# different seed for the different splits | ||
self._random.randint(sys.maxint) | ||
else: | ||
import random |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since, we have imported random at the beginning. This line is unnecessary.
ok. removed redundant 'import random' |
@@ -465,7 +465,13 @@ class RDDSuite extends FunSuite with SharedSparkContext { | |||
|
|||
test("takeSample") { | |||
val data = sc.parallelize(1 to 100, 2) | |||
|
|||
|
|||
for (num <- List(5,20,100)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put spaces after the commas here
Thanks for the changes, this looks pretty good now. Made a few small comments on it. |
Great. Fixed the space formatting as suggested. |
Jenkins, test this please |
Build triggered. |
Build started. |
Build finished. All automated tests passed. |
All automated tests passed. |
Thanks Arun! I've merged this in now. |
copying form previous pull request #462 Its probably better to let the underlying language implementation take care of the default . This was easier to do with python as the default value for seed in random and numpy random is None. In Scala/Java side it might mean propagating an Option or null(oh no!) down the chain until where the Random is constructed. But, looks like the convention in some other methods was to use System.nanoTime. So, followed that convention. Conflict with overloaded method in sql.SchemaRDD.sample which also defines default params. sample(fraction, withReplacement=false, seed=math.random) Scala does not allow more than one overloaded to have default params. I believe the author intended to override the RDD.sample method and not overload it. So, changed it. If backward compatible is important, 3 new method can be introduced (without default params) like this sample(fraction) sample(fraction, withReplacement) sample(fraction, withReplacement, seed) Added some tests for the scala RDD takeSample method. Author: Arun Ramakrishnan <smartnut007@gmail.com> This patch had conflicts when merged, resolved by Committer: Matei Zaharia <matei@databricks.com> Closes #477 from smartnut007/master and squashes the following commits: 07bb06e [Arun Ramakrishnan] SPARK-1438 fixing more space formatting issues b9ebfe2 [Arun Ramakrishnan] SPARK-1438 removing redundant import of random in python rddsampler 8d05b1a [Arun Ramakrishnan] SPARK-1438 RDD . Replace System.nanoTime with a Random generated number. python: use a separate instance of Random instead of seeding language api global Random instance. 69619c6 [Arun Ramakrishnan] SPARK-1438 fix spacing issue 0c247db [Arun Ramakrishnan] SPARK-1438 RDD language apis to support optional seed in RDD methods sample/takeSample (cherry picked from commit 35e3d19) Signed-off-by: Matei Zaharia <matei@databricks.com>
Handful of 0.9 fixes This patch addresses a few fixes for Spark 0.9.0 based on the last release candidate. @mridulm gets credit for reporting most of the issues here. Many of the fixes here are based on his work in apache#477 and follow up discussion with him.
copying form previous pull request apache#462 Its probably better to let the underlying language implementation take care of the default . This was easier to do with python as the default value for seed in random and numpy random is None. In Scala/Java side it might mean propagating an Option or null(oh no!) down the chain until where the Random is constructed. But, looks like the convention in some other methods was to use System.nanoTime. So, followed that convention. Conflict with overloaded method in sql.SchemaRDD.sample which also defines default params. sample(fraction, withReplacement=false, seed=math.random) Scala does not allow more than one overloaded to have default params. I believe the author intended to override the RDD.sample method and not overload it. So, changed it. If backward compatible is important, 3 new method can be introduced (without default params) like this sample(fraction) sample(fraction, withReplacement) sample(fraction, withReplacement, seed) Added some tests for the scala RDD takeSample method. Author: Arun Ramakrishnan <smartnut007@gmail.com> This patch had conflicts when merged, resolved by Committer: Matei Zaharia <matei@databricks.com> Closes apache#477 from smartnut007/master and squashes the following commits: 07bb06e [Arun Ramakrishnan] SPARK-1438 fixing more space formatting issues b9ebfe2 [Arun Ramakrishnan] SPARK-1438 removing redundant import of random in python rddsampler 8d05b1a [Arun Ramakrishnan] SPARK-1438 RDD . Replace System.nanoTime with a Random generated number. python: use a separate instance of Random instead of seeding language api global Random instance. 69619c6 [Arun Ramakrishnan] SPARK-1438 fix spacing issue 0c247db [Arun Ramakrishnan] SPARK-1438 RDD language apis to support optional seed in RDD methods sample/takeSample
Handful of 0.9 fixes This patch addresses a few fixes for Spark 0.9.0 based on the last release candidate. @mridulm gets credit for reporting most of the issues here. Many of the fixes here are based on his work in apache#477 and follow up discussion with him. (cherry picked from commit 77b986f) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
Same change has been made to build due to deadlock in maven shadow plugin
Update default go to latest - 1.12.1
…#479) * Revert "KE-37052 translate boolean column to V2Predicate (apache#477)" This reverts commit 7796f19. * KE-37052 translate boolean column to V2Predicate (apache#476) * KE-37052 translate boolean column to V2Predicate * update spark version
copying form previous pull request #462
Its probably better to let the underlying language implementation take care of the default . This was easier to do with python as the default value for seed in random and numpy random is None.
In Scala/Java side it might mean propagating an Option or null(oh no!) down the chain until where the Random is constructed. But, looks like the convention in some other methods was to use System.nanoTime. So, followed that convention.
Conflict with overloaded method in sql.SchemaRDD.sample which also defines default params.
sample(fraction, withReplacement=false, seed=math.random)
Scala does not allow more than one overloaded to have default params. I believe the author intended to override the RDD.sample method and not overload it. So, changed it.
If backward compatible is important, 3 new method can be introduced (without default params) like this
sample(fraction)
sample(fraction, withReplacement)
sample(fraction, withReplacement, seed)
Added some tests for the scala RDD takeSample method.