[SPARK-27778][PYTHON] Fix toPandas conversion of empty DataFrame with Arrow enabled #24650

dvogelbacher · 2019-05-20T14:12:27Z

What changes were proposed in this pull request?

#22275 introduced a performance improvement where we send partitions out of order to python and then, as a last step, send the partition order as well.
However, if there are no partitions we will never send the partition order and we will get an "EofError" on the python side.
This PR fixes this by also sending the partition order if there are no partitions present.

How was this patch tested?

New unit test added.

dvogelbacher · 2019-05-20T16:28:30Z

@BryanCutler can you take a look at this one

BryanCutler

Thanks for catching this @dvogelbacher !

python/pyspark/sql/tests/test_arrow.py

BryanCutler · 2019-05-20T22:36:50Z

ok to test

SparkQA · 2019-05-21T01:43:20Z

Test build #105580 has finished for PR 24650 at commit d635a74.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-05-21T03:32:55Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

        }

        sparkSession.sparkContext.runJob(
          arrowBatchRdd,
          (ctx: TaskContext, it: Iterator[Array[Byte]]) => it.toArray,
          0 until numPartitions,
          handlePartitionBatches)
+
+        if (numPartitions == 0) {


This method is well-commented. Can you add another comment that we should end stream when partitions are empty?

Also, I would do:

partitions = 0 until numPartitions sparkSession.sparkContext.runJob( arrowBatchRdd, (ctx: TaskContext, it: Iterator[Array[Byte]]) => it.toArray, partitions, handlePartitionBatches) if (partitions.isEmpty) { // Currently result handler is not called when given partitions are empty. // Therefore, we should end stream here. doAfterLastPartition() }

HyukjinKwon · 2019-05-21T03:42:28Z

Looks fine given skimming the codes.

SparkQA · 2019-05-21T04:42:49Z

Test build #105588 has finished for PR 24650 at commit 52a51bf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2019-05-21T05:29:23Z

I had another thought about this, the stuff in doAfterLastPartition could be removed from handlePartitionBatches and called after runJob regardless if partitions are empty or not. This really wouldn't make any difference performance-wise because it's just moving it outside the callback function and Python is waiting on it anyway.

It also has the benefit that the number of partitions would not have to be kept track of, so the variables partitionCount and numPartitions could be removed. It would be a lot clearer then too.

What do you think @dvogelbacher and @HyukjinKwon ?

HyukjinKwon · 2019-05-21T05:40:05Z

Yea, SGTM.

dvogelbacher · 2019-05-21T13:33:38Z

yes, that's a good idea @BryanCutler, it is much clearer. I've made the change.

SparkQA · 2019-05-21T17:08:04Z

Test build #105619 has finished for PR 24650 at commit 9f4bc3e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler

Looks pretty good now, just a couple more minor things that could be done to clean it up a bit more if you wouldn't mind.

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

dvogelbacher · 2019-05-22T00:38:43Z

of course, I addressed the comments @BryanCutler

HyukjinKwon

Yea, this way looks better. Looks good to me too

SparkQA · 2019-05-22T03:36:24Z

Test build #105647 has finished for PR 24650 at commit db6f4b1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-05-22T04:21:00Z

Merged to master.

dvogelbacher added 2 commits May 20, 2019 08:56

do after all partitions code when 0 partitions

ec8d280

pytest

d635a74

BryanCutler reviewed May 20, 2019

View reviewed changes

python/pyspark/sql/tests/test_arrow.py Show resolved Hide resolved

assert column name

52a51bf

HyukjinKwon reviewed May 21, 2019

View reviewed changes

HyukjinKwon changed the title ~~[SPARK-27778][PySpark] Fix toPandas conversion using arrow for DFs with no partitions~~ [SPARK-27778][PYTHON] Fix toPandas conversion of empty DataFrame with Arrow enabled May 21, 2019

always write batch order after runJob

9f4bc3e

BryanCutler reviewed May 22, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala Outdated Show resolved Hide resolved

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala Outdated Show resolved Hide resolved

cr

db6f4b1

HyukjinKwon approved these changes May 22, 2019

View reviewed changes

HyukjinKwon closed this in 034cb13 May 22, 2019

This was referenced Nov 20, 2019

Cherry pick "Propagate SparkExceptions during toPandas with arrow enabled" palantir/spark#623

Closed

[SPARK-27778][PYTHON] Fix toPandas conversion of empty DataFrame with Arrow enabled palantir/spark#625

Merged

[SPARK-27778][PYTHON] Fix toPandas conversion of empty DataFrame with Arrow enabled #24650

[SPARK-27778][PYTHON] Fix toPandas conversion of empty DataFrame with Arrow enabled #24650

Uh oh!

Conversation

dvogelbacher commented May 20, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

dvogelbacher commented May 20, 2019

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

BryanCutler commented May 20, 2019

Uh oh!

SparkQA commented May 21, 2019

Uh oh!

HyukjinKwon May 21, 2019

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented May 21, 2019

Uh oh!

SparkQA commented May 21, 2019

Uh oh!

BryanCutler commented May 21, 2019

Uh oh!

HyukjinKwon commented May 21, 2019

Uh oh!

dvogelbacher commented May 21, 2019

Uh oh!

SparkQA commented May 21, 2019

Uh oh!

BryanCutler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dvogelbacher commented May 22, 2019

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 22, 2019

Uh oh!

HyukjinKwon commented May 22, 2019

Uh oh!

Uh oh!