-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-3519] add distinct(n) to PySpark #2383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
QA tests have started for PR 2383 at commit
|
QA tests have finished for PR 2383 at commit
|
@@ -586,6 +586,17 @@ def test_repartitionAndSortWithinPartitions(self): | |||
self.assertEquals(partitions[0], [(0, 5), (0, 8), (2, 6)]) | |||
self.assertEquals(partitions[1], [(1, 3), (3, 8), (3, 8)]) | |||
|
|||
def test_distinct(self): | |||
rdd = self.sc.parallelize((1,2,3)*10).distinct() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jenkins failed because the Python style checks didn't pass; the problem is that PEP8 requires whitespace after commas and space around operators like *
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, i forgot to run lint. i've updated the patch...
$ ./dev/lint-python
PEP 8 checks passed.
Added missing rdd.distinct(numPartitions) and associated tests
QA tests have started for PR 2383 at commit
|
QA tests have finished for PR 2383 at commit
|
rdd = self.sc.parallelize((1, 2, 3)*10).distinct() | ||
self.assertEquals(rdd.count(), 3) | ||
|
||
def test_distinct_numPartitions(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to put them into single test case, because each test cases will create a new jvm, which has some overhead. We should keep the number of test cases not increase too much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can i have a pass? it looks like the python tests could use some attention during the test speed increase effort, but it'd rather wait for a big speedup recommendation before altering these cases.
though, if this is important to you, i'll do it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, this two cases are about distinct(), just for different cases, it's also better to put them together. I will really appreciate if you could do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'll do that now.
QA tests have started for PR 2383 at commit
|
QA tests have finished for PR 2383 at commit
|
QA tests have started for PR 2383 at commit
|
QA tests have finished for PR 2383 at commit
|
This looks good to me, so I'm going to merge it. Thanks! |
Added missing rdd.distinct(numPartitions) and associated tests