Skip to content

[SPARK-14334] [SQL] add toLocalIterator for Dataset/DataFrame #12114

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

davies
Copy link
Contributor

@davies davies commented Apr 1, 2016

What changes were proposed in this pull request?

RDD.toLocalIterator() could be used to fetch one partition at a time to reduce the memory usage. Right now, for Dataset/Dataframe we have to use df.rdd.toLocalIterator, which is super slow also requires lots of memory (because of the Java serializer or even Kyro serializer).

This PR introduce an optimized toLocalIterator for Dataset/DataFrame, which is much faster and requires much less memory. For a partition with 5 millions rows, df.rdd.toIterator took about 100 seconds, but df.toIterator took less than 7 seconds. For 10 millions row, rdd.toIterator will crash (not enough memory) with 4G heap, but df.toLocalIterator could finished in 12 seconds.

The JDBC server has been updated to use DataFrame.toIterator.

How was this patch tested?

Existing tests.

@davies
Copy link
Contributor Author

davies commented Apr 1, 2016

cc @marmbrus @rxin

@rxin
Copy link
Contributor

rxin commented Apr 1, 2016

hm we need to think about the api -- this is a scala iterator right? it's going to be problematic for java users.

maybe we can have toLocalScalaIterator and toLocalJavaIterator, or just have toLocalIterator returning a Java iterator, and scala users can easily do the implicit conversion or using asScala anyway.

@davies
Copy link
Contributor Author

davies commented Apr 1, 2016

@rxin Good point, will use the Java iterator as public API.

@SparkQA
Copy link

SparkQA commented Apr 1, 2016

Test build #54715 has finished for PR 12114 at commit b12097e.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 1, 2016

Test build #54714 has finished for PR 12114 at commit 52d7520.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 1, 2016

Test build #54716 has finished for PR 12114 at commit e62d35a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Apr 2, 2016

cc @sameeragarwal for review

@@ -71,6 +71,7 @@ class DatasetSuite extends QueryTest with SharedSQLContext {
assert(ds.first() == item)
assert(ds.take(1).head == item)
assert(ds.takeAsList(1).get(0) == item)
assert(ds.toLocalIterator().next === item)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: next()

@sameeragarwal
Copy link
Member

LGTM

@SparkQA
Copy link

SparkQA commented Apr 4, 2016

Test build #54869 has finished for PR 12114 at commit 0f75cec.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in cc70f17 Apr 4, 2016
@SparkQA
Copy link

SparkQA commented Apr 4, 2016

Test build #54872 has finished for PR 12114 at commit 34b3f1c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants