Skip to content

Conversation

@fangchenli
Copy link

What changes were proposed in this pull request?

Only convert Arrow columns that are actually used by the scalar pandas UDF(s).

Why are the changes needed?

When executing a scalar Pandas UDF, PySpark currently converts all Arrow columns to Pandas Series, even if the UDF only uses a subset of columns. This is wasteful when working with wide DataFrames, where the UDF needs only a few columns.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Unit tests included.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Opus 4.5

@github-actions
Copy link

github-actions bot commented Jan 5, 2026

JIRA Issue Information

=== Improvement SPARK-54901 ===
Summary: Selective column conversion for scalar Pandas UDFs
Assignee: None
Status: Open
Affected: ["4.2.0"]


This comment was automatically generated by GitHub Actions

@fangchenli fangchenli marked this pull request as ready for review January 5, 2026 05:28
@fangchenli fangchenli changed the title [SPARK-54901] Selective column conversion for scalar pandas UDFs [SPARK-54901][Python] Selective column conversion for scalar pandas UDFs Jan 5, 2026
@fangchenli fangchenli marked this pull request as draft January 5, 2026 22:32
@fangchenli
Copy link
Author

Closing since this optimization is not necessary.

@fangchenli fangchenli closed this Jan 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants