Skip to content

[SPARK-27778][PYTHON] Fix toPandas conversion of empty DataFrame with Arrow enabled #24650

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from
Closed

[SPARK-27778][PYTHON] Fix toPandas conversion of empty DataFrame with Arrow enabled #24650

wants to merge 5 commits into from

Conversation

dvogelbacher
Copy link
Contributor

What changes were proposed in this pull request?

#22275 introduced a performance improvement where we send partitions out of order to python and then, as a last step, send the partition order as well.
However, if there are no partitions we will never send the partition order and we will get an "EofError" on the python side.
This PR fixes this by also sending the partition order if there are no partitions present.

How was this patch tested?

New unit test added.

@dvogelbacher
Copy link
Contributor Author

@BryanCutler can you take a look at this one

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this @dvogelbacher !

@BryanCutler
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented May 21, 2019

Test build #105580 has finished for PR 24650 at commit d635a74.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

sparkSession.sparkContext.runJob(
arrowBatchRdd,
(ctx: TaskContext, it: Iterator[Array[Byte]]) => it.toArray,
0 until numPartitions,
handlePartitionBatches)

if (numPartitions == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is well-commented. Can you add another comment that we should end stream when partitions are empty?

Also, I would do:

partitions = 0 until numPartitions
sparkSession.sparkContext.runJob(
  arrowBatchRdd,
  (ctx: TaskContext, it: Iterator[Array[Byte]]) => it.toArray,
  partitions,
  handlePartitionBatches)

if (partitions.isEmpty) {
  // Currently result handler is not called when given partitions are empty.
  // Therefore, we should end stream here.
  doAfterLastPartition()
}

@HyukjinKwon
Copy link
Member

Looks fine given skimming the codes.

@HyukjinKwon HyukjinKwon changed the title [SPARK-27778][PySpark] Fix toPandas conversion using arrow for DFs with no partitions [SPARK-27778][PYTHON] Fix toPandas conversion of empty DataFrame with Arrow enabled May 21, 2019
@SparkQA
Copy link

SparkQA commented May 21, 2019

Test build #105588 has finished for PR 24650 at commit 52a51bf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member

I had another thought about this, the stuff in doAfterLastPartition could be removed from handlePartitionBatches and called after runJob regardless if partitions are empty or not. This really wouldn't make any difference performance-wise because it's just moving it outside the callback function and Python is waiting on it anyway.

It also has the benefit that the number of partitions would not have to be kept track of, so the variables partitionCount and numPartitions could be removed. It would be a lot clearer then too.

What do you think @dvogelbacher and @HyukjinKwon ?

@HyukjinKwon
Copy link
Member

Yea, SGTM.

@dvogelbacher
Copy link
Contributor Author

yes, that's a good idea @BryanCutler, it is much clearer. I've made the change.

@SparkQA
Copy link

SparkQA commented May 21, 2019

Test build #105619 has finished for PR 24650 at commit 9f4bc3e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good now, just a couple more minor things that could be done to clean it up a bit more if you wouldn't mind.

@dvogelbacher
Copy link
Contributor Author

of course, I addressed the comments @BryanCutler

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, this way looks better. Looks good to me too

@SparkQA
Copy link

SparkQA commented May 22, 2019

Test build #105647 has finished for PR 24650 at commit db6f4b1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants