Skip to content

[SPARK-3519] add distinct(n) to PySpark #2383

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

Conversation

mattf
Copy link

@mattf mattf commented Sep 13, 2014

Added missing rdd.distinct(numPartitions) and associated tests

@SparkQA
Copy link

SparkQA commented Sep 13, 2014

QA tests have started for PR 2383 at commit fcfc05e.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 13, 2014

QA tests have finished for PR 2383 at commit fcfc05e.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class JavaSparkContext(val sc: SparkContext)
    • throw new IllegalStateException("The main method in the given main class must be static")
    • class TaskCompletionListenerException(errorMessages: Seq[String]) extends Exception
    • class Dummy(object):
    • class JavaStreamingContext(val ssc: StreamingContext) extends Closeable

@@ -586,6 +586,17 @@ def test_repartitionAndSortWithinPartitions(self):
self.assertEquals(partitions[0], [(0, 5), (0, 8), (2, 6)])
self.assertEquals(partitions[1], [(1, 3), (3, 8), (3, 8)])

def test_distinct(self):
rdd = self.sc.parallelize((1,2,3)*10).distinct()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jenkins failed because the Python style checks didn't pass; the problem is that PEP8 requires whitespace after commas and space around operators like *.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, i forgot to run lint. i've updated the patch...

$ ./dev/lint-python
PEP 8 checks passed.

Added missing rdd.distinct(numPartitions) and associated tests
@SparkQA
Copy link

SparkQA commented Sep 14, 2014

QA tests have started for PR 2383 at commit 7a17f2b.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 14, 2014

QA tests have finished for PR 2383 at commit 7a17f2b.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class RatingDeserializer(FramedSerializer):
    • class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T]
    • class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T]
    • class Encoder[T <: NativeType](columnType: NativeColumnType[T]) extends compression.Encoder[T]
    • class Encoder extends compression.Encoder[IntegerType.type]
    • class Decoder(buffer: ByteBuffer, columnType: NativeColumnType[IntegerType.type])
    • class Encoder extends compression.Encoder[LongType.type]
    • class Decoder(buffer: ByteBuffer, columnType: NativeColumnType[LongType.type])

rdd = self.sc.parallelize((1, 2, 3)*10).distinct()
self.assertEquals(rdd.count(), 3)

def test_distinct_numPartitions(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to put them into single test case, because each test cases will create a new jvm, which has some overhead. We should keep the number of test cases not increase too much.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can i have a pass? it looks like the python tests could use some attention during the test speed increase effort, but it'd rather wait for a big speedup recommendation before altering these cases.

though, if this is important to you, i'll do it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, this two cases are about distinct(), just for different cases, it's also better to put them together. I will really appreciate if you could do it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll do that now.

@SparkQA
Copy link

SparkQA commented Sep 15, 2014

QA tests have started for PR 2383 at commit 6bc4a2c.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 15, 2014

QA tests have finished for PR 2383 at commit 6bc4a2c.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 15, 2014

QA tests have started for PR 2383 at commit 30b837a.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 15, 2014

QA tests have finished for PR 2383 at commit 30b837a.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

This looks good to me, so I'm going to merge it. Thanks!

@asfgit asfgit closed this in 9d5fa76 Sep 16, 2014
@mattf mattf deleted the SPARK-3519 branch September 16, 2014 18:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants