-
Notifications
You must be signed in to change notification settings - Fork 28.5k
[SPARK-4477] [PySpark] remove numpy from RDDSampler #3351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…into numpy Conflicts: python/pyspark/rddsampler.py
Test build #23574 has started for PR 3351 at commit
|
Test build #23574 has finished for PR 3351 at commit
|
Test PASSed. |
This issue was discussed in #2313 and #3193 . I support this change because it simplifies the implementation and eliminates the concerns in #2313 and the bug fixed in #2889 . Though I haven't tested python's random, I'm not worried about its quality. Both python and numpy implement MT19937. The only downside I can see is the performance regression during sampling, but it is usually not the bottleneck of a job. |
for _ in range(0, count): | ||
yield key, val | ||
else: | ||
for key, val in iterator: | ||
if self.getUniformSample(split) <= self._fractions[key]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using equal for float does not make sense.
Test build #23616 has started for PR 3351 at commit
|
Test build #23616 has finished for PR 3351 at commit
|
Test FAILed. |
test this please |
Test build #23680 has started for PR 3351 at commit
|
Test build #23680 has finished for PR 3351 at commit
|
Test PASSed. |
make poisson sampling slightly faster
Test build #23685 has started for PR 3351 at commit
|
micro benchmark for possion (with fraction=0.1): old one:
new one:
So it's about two times faster. |
Test build #23686 has started for PR 3351 at commit
|
@mengxr I had updated the test result, now it's as fast as before (small difference). |
LGTM. Waiting for Jenkins ... |
Test build #23685 has finished for PR 3351 at commit
|
Test PASSed. |
Test build #23686 has finished for PR 3351 at commit
|
Test PASSed. |
Merged into master and branch-1.2. Thanks! |
In RDDSampler, it try use numpy to gain better performance for possion(), but the number of call of random() is only (1+faction) * N in the pure python implementation of possion(), so there is no much performance gain from numpy. numpy is not a dependent of pyspark, so it maybe introduce some problem, such as there is no numpy installed in slaves, but only installed master, as reported in SPARK-927. It also complicate the code a lot, so we may should remove numpy from RDDSampler. I also did some benchmark to verify that: ``` >>> from pyspark.mllib.random import RandomRDDs >>> rdd = RandomRDDs.uniformRDD(sc, 1 << 20, 1).cache() >>> rdd.count() # cache it >>> rdd.sample(True, 0.9).count() # measure this line ``` the results: |withReplacement | random | numpy.random | ------- | ------------ | ------- |True | 1.5 s| 1.4 s| |False| 0.6 s | 0.8 s| closes apache#2313 Note: this patch including some commits that not mirrored to github, it will be OK after it catches up. Author: Davies Liu <davies@databricks.com> Author: Xiangrui Meng <meng@databricks.com> Closes apache#3351 from davies/numpy and squashes the following commits: 5c438d7 [Davies Liu] fix comment c5b9252 [Davies Liu] Merge pull request #1 from mengxr/SPARK-4477 98eb31b [Xiangrui Meng] make poisson sampling slightly faster ee17d78 [Davies Liu] remove = for float 13f7b05 [Davies Liu] Merge branch 'master' of http://git-wip-us.apache.org/repos/asf/spark into numpy f583023 [Davies Liu] fix tests 51649f5 [Davies Liu] remove numpy in RDDSampler 78bf997 [Davies Liu] fix tests, do not use numpy in randomSplit, no performance gain f5fdf63 [Davies Liu] fix bug with int in weights 4dfa2cd [Davies Liu] refactor f866bcf [Davies Liu] remove unneeded change c7a2007 [Davies Liu] switch to python implementation 95a48ac [Davies Liu] Merge branch 'master' of github.com:apache/spark into randomSplit 0d9b256 [Davies Liu] refactor 1715ee3 [Davies Liu] address comments 41fce54 [Davies Liu] randomSplit() (cherry picked from commit d39f2e9) Signed-off-by: Xiangrui Meng <meng@databricks.com>
In RDDSampler, it try use numpy to gain better performance for possion(), but the number of call of random() is only (1+faction) * N in the pure python implementation of possion(), so there is no much performance gain from numpy.
numpy is not a dependent of pyspark, so it maybe introduce some problem, such as there is no numpy installed in slaves, but only installed master, as reported in SPARK-927.
It also complicate the code a lot, so we may should remove numpy from RDDSampler.
I also did some benchmark to verify that:
the results:
closes #2313
Note: this patch including some commits that not mirrored to github, it will be OK after it catches up.