-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-54598][PYTHON] Extract logic to read UDFs #53330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-54598][PYTHON] Extract logic to read UDFs #53330
Conversation
|
Do you need another function for it? It's basically just two lines: num_udfs = read_int(infile)
udfs = [
read_single_udf(
pickleSer, infile, eval_type, runner_conf, udf_index=i, profiler=profiler
)
for i in range(num_udfs)
]My real concern is - what's the difference between |
read_all_udfsread_all_udfs
read_all_udfs
python/pyspark/worker.py
Outdated
|
|
||
| # Read all UDFs | ||
| num_udfs = read_int(infile) | ||
| udfs = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason you don't like the list comprehension here? It's very commonly used, significantly faster (doesn't matter here because the function itself is super slow) and use less code. Do you have concerns for readability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, we could change to list comprehension. this was the old logic, I just moved it here.
Thanks. Either way is fine. I have changed to inline the function. |
|
@zhengruifeng could you please also review this? thanks |
|
merged to master |
What changes were proposed in this pull request?
This PR refactors the UDF reading logic in
read_udfs()to eliminate code duplication. Currently, the logic for reading UDFs (functions and their argument offsets) is duplicated across multipleeval_typebranches, with different patterns for single UDF vs. multiple UDFs cases.Why are the changes needed?
This duplication makes the code harder to maintain and increases the risk of inconsistencies. By centralizing the UDF reading logic at the beginning of
read_udfs(), we can:Does this PR introduce any user-facing change?
No, this is an internal refactoring that maintains backward compatibility. The API behavior remains the same from the user's perspective.
How was this patch tested?
Existing Tests
Was this patch authored or co-authored using generative AI tooling?
No