[SPARK-14334] [SQL] add toLocalIterator for Dataset/DataFrame #12114

davies · 2016-04-01T20:40:04Z

What changes were proposed in this pull request?

RDD.toLocalIterator() could be used to fetch one partition at a time to reduce the memory usage. Right now, for Dataset/Dataframe we have to use df.rdd.toLocalIterator, which is super slow also requires lots of memory (because of the Java serializer or even Kyro serializer).

This PR introduce an optimized toLocalIterator for Dataset/DataFrame, which is much faster and requires much less memory. For a partition with 5 millions rows, df.rdd.toIterator took about 100 seconds, but df.toIterator took less than 7 seconds. For 10 millions row, rdd.toIterator will crash (not enough memory) with 4G heap, but df.toLocalIterator could finished in 12 seconds.

The JDBC server has been updated to use DataFrame.toIterator.

How was this patch tested?

Existing tests.

davies · 2016-04-01T20:40:24Z

cc @marmbrus @rxin

rxin · 2016-04-01T20:43:50Z

hm we need to think about the api -- this is a scala iterator right? it's going to be problematic for java users.

maybe we can have toLocalScalaIterator and toLocalJavaIterator, or just have toLocalIterator returning a Java iterator, and scala users can easily do the implicit conversion or using asScala anyway.

davies · 2016-04-01T20:49:09Z

@rxin Good point, will use the Java iterator as public API.

SparkQA · 2016-04-01T21:02:37Z

Test build #54715 has finished for PR 12114 at commit b12097e.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-01T22:12:26Z

Test build #54714 has finished for PR 12114 at commit 52d7520.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-01T22:35:05Z

Test build #54716 has finished for PR 12114 at commit e62d35a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-04-02T05:49:01Z

cc @sameeragarwal for review

rxin · 2016-04-02T07:20:35Z

sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala

@@ -71,6 +71,7 @@ class DatasetSuite extends QueryTest with SharedSQLContext {
    assert(ds.first() == item)
    assert(ds.take(1).head == item)
    assert(ds.takeAsList(1).get(0) == item)
+    assert(ds.toLocalIterator().next === item)


nit: next()

sameeragarwal · 2016-04-04T18:37:57Z

LGTM

SparkQA · 2016-04-04T20:21:11Z

Test build #54869 has finished for PR 12114 at commit 0f75cec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-04T20:42:52Z

Test build #54872 has finished for PR 12114 at commit 34b3f1c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

add toLocalIterator for Dataset

52d7520

davies force-pushed the local_iterator branch from b12097e to 75d2295 Compare April 1, 2016 20:58

use Java iterator

e62d35a

davies force-pushed the local_iterator branch from 75d2295 to e62d35a Compare April 1, 2016 21:00

rxin reviewed Apr 2, 2016
View reviewed changes

Davies Liu added 3 commits April 4, 2016 10:47

add Python API

a647abb

Merge branch 'master' of github.com:apache/spark into local_iterator

0f75cec

Update rdd.py

34b3f1c

asfgit closed this in cc70f17 Apr 4, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-14334] [SQL] add toLocalIterator for Dataset/DataFrame #12114

[SPARK-14334] [SQL] add toLocalIterator for Dataset/DataFrame #12114

Uh oh!

davies commented Apr 1, 2016

Uh oh!

davies commented Apr 1, 2016

Uh oh!

rxin commented Apr 1, 2016

Uh oh!

davies commented Apr 1, 2016

Uh oh!

SparkQA commented Apr 1, 2016

Uh oh!

SparkQA commented Apr 1, 2016

Uh oh!

SparkQA commented Apr 1, 2016

Uh oh!

rxin commented Apr 2, 2016

Uh oh!

rxin Apr 2, 2016

Uh oh!

sameeragarwal commented Apr 4, 2016

Uh oh!

SparkQA commented Apr 4, 2016

Uh oh!

SparkQA commented Apr 4, 2016

Uh oh!

Uh oh!

[SPARK-14334] [SQL] add toLocalIterator for Dataset/DataFrame #12114

[SPARK-14334] [SQL] add toLocalIterator for Dataset/DataFrame #12114

Uh oh!

Conversation

davies commented Apr 1, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

davies commented Apr 1, 2016

Uh oh!

rxin commented Apr 1, 2016

Uh oh!

davies commented Apr 1, 2016

Uh oh!

SparkQA commented Apr 1, 2016

Uh oh!

SparkQA commented Apr 1, 2016

Uh oh!

SparkQA commented Apr 1, 2016

Uh oh!

rxin commented Apr 2, 2016

Uh oh!

rxin Apr 2, 2016

Choose a reason for hiding this comment

Uh oh!

sameeragarwal commented Apr 4, 2016

Uh oh!

SparkQA commented Apr 4, 2016

Uh oh!

SparkQA commented Apr 4, 2016

Uh oh!

Uh oh!