-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-27778][PYTHON] Fix toPandas conversion of empty DataFrame with Arrow enabled #24650
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-27778][PYTHON] Fix toPandas conversion of empty DataFrame with Arrow enabled #24650
Conversation
@BryanCutler can you take a look at this one |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for catching this @dvogelbacher !
ok to test |
Test build #105580 has finished for PR 24650 at commit
|
} | ||
|
||
sparkSession.sparkContext.runJob( | ||
arrowBatchRdd, | ||
(ctx: TaskContext, it: Iterator[Array[Byte]]) => it.toArray, | ||
0 until numPartitions, | ||
handlePartitionBatches) | ||
|
||
if (numPartitions == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is well-commented. Can you add another comment that we should end stream when partitions are empty?
Also, I would do:
partitions = 0 until numPartitions
sparkSession.sparkContext.runJob(
arrowBatchRdd,
(ctx: TaskContext, it: Iterator[Array[Byte]]) => it.toArray,
partitions,
handlePartitionBatches)
if (partitions.isEmpty) {
// Currently result handler is not called when given partitions are empty.
// Therefore, we should end stream here.
doAfterLastPartition()
}
Looks fine given skimming the codes. |
Test build #105588 has finished for PR 24650 at commit
|
I had another thought about this, the stuff in It also has the benefit that the number of partitions would not have to be kept track of, so the variables What do you think @dvogelbacher and @HyukjinKwon ? |
Yea, SGTM. |
yes, that's a good idea @BryanCutler, it is much clearer. I've made the change. |
Test build #105619 has finished for PR 24650 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good now, just a couple more minor things that could be done to clean it up a bit more if you wouldn't mind.
of course, I addressed the comments @BryanCutler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, this way looks better. Looks good to me too
Test build #105647 has finished for PR 24650 at commit
|
Merged to master. |
What changes were proposed in this pull request?
#22275 introduced a performance improvement where we send partitions out of order to python and then, as a last step, send the partition order as well.
However, if there are no partitions we will never send the partition order and we will get an "EofError" on the python side.
This PR fixes this by also sending the partition order if there are no partitions present.
How was this patch tested?
New unit test added.