-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-19161][PYTHON][SQL] Improving UDF Docstrings #16534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #71160 has finished for PR 16534 at commit
|
Note: We'll need a bit more work if we want to provide correct argument list for legacy Python.
|
There is an alternative implementation where we could dynamically subclass |
Improving UDF Docstrings for Python seems like a good idea, but at the cost of breaking the public API in a point release I think it might make sense for us to do the more work approach unless there is a really strong argument for why this part of the API isn't really public. But that's just my thoughts, what maybe @davies has a different opinion? |
@holdenk I don't think it should go to the point release at all (same as #16533 which, depending on the resolution, may introduce new functionality or breaking API changes). #16538 went to 2.2 so I think it is a reasonable target for all subtasks in SPARK-19159. That being said public vs. private is a bit fuzzy here.
and doesn't document return type otherwise. This is obviously not true. It is also worth noting that we can use a function wrapper without any changes to the API. It is not the most common practice but we can add required attributes to the function to keep full backward compatibility for the time being. One way or another it would be nice to make it consistent with SPARK-18777 though. If we go with a function wrapper here, it would make sense to use one there as well. |
It's a bit hard to follow up wit those during JIRA maintenance window - I'll follow up after JIRA comes back online :) |
@holdenk Indeed. Not the most fortunate moment for making a bunch of connected PRs :) |
Test build #71680 has finished for PR 16534 at commit
|
Test build #71685 has finished for PR 16534 at commit
|
@holdenk I used function arguments to make sure that public API, though not types, is preserved. Please let me know what you think. |
Test build #71723 has finished for PR 16534 at commit
|
So I'm not super comfortable changing the return type (what about if user code has |
Is the goal to change the doc or the repl string? It might be useful to change the repl string but I'm not sure if it is worth changing the doc. |
@rxin I am not aware of any straightforward way of separating these two, but I focused on the docstrings anyway. The rationale is simple - I want to be able to:
|
Test build #72242 has finished for PR 16534 at commit
|
3b3a41b
to
2a0ac46
Compare
Test build #72951 has finished for PR 16534 at commit
|
Test build #72949 has finished for PR 16534 at commit
|
This PR adds `udf` decorator syntax as proposed in [SPARK-19160](https://issues.apache.org/jira/browse/SPARK-19160). This allows users to define UDF using simplified syntax: ``` from pyspark.sql.decorators import udf @udf(IntegerType()) def add_one(x): """Adds one""" if x is not None: return x + 1 ``` without need to define a separate function and udf. Tested wiht existing unit tests to ensure backward compatibility and additional unit tests covering new functionality.
Test build #72966 has finished for PR 16534 at commit
|
Change looks good to me but I didn't look super carefully. @holdenk can you take a look at this? |
Sure, I'll take another closer look. |
So it feels like we are adding an extra layer of indirection unnecessarily, could you use update_wrapper from functools directly on the udf object? |
To a very limited extent. It can bring some useful information in IPython / Jupyter (maybe some other tools as well) but won't work with built-in You can compare: from functools import wraps
def f(x, *args):
"""This is
some function"""
return x
class F():
def __init__(self, f):
self.f = f
def __call__(self, x):
return f(x)
g = wraps(f)(F(f))
@wraps(f)
def h(x):
return F(f)(x)
?g
help(g)
?h
help(h) As far as I am aware it is either this or dynamical inheritance. |
I'm not sure about In IPython
results in a help string (from
Which seems like everything the current implementation does without adding the indirection. Is this not the behavior you are seeing? |
|
Yes pydoc.help does depend on looking at the docstring on the type rather than the object :( Too bad the IPython magic isn't used in pydoc too. Sorry for all the back and forth, I'm just trying to see if we can improve the documentation without slowing down our already not-super-fast Python UDF performance - how would you feel about doing a small perf test with Python UDFs to make sure this doesn't cause a regression? If there is no regression it looks fine, but if there is maybe we should explore the dynamic sub-classing option. |
Don't worry, I get it :) The point is to make user experience better not worse, right? In practice:
Rough tests (build 8f33731):
|
Great! Thanks for doing this, will merge to master :) |
Merged to master, thanks @zero323 |
Thanks @holdenk. I think it should be mentioned as a change of behavior in the release notes. We don't change API, and |
I agree, just in case someone does have an isinstance check (or similar) we should document the change in the release notes. |
## What changes were proposed in this pull request? Replaces `UserDefinedFunction` object returned from `udf` with a function wrapper providing docstring and arguments information as proposed in [SPARK-19161](https://issues.apache.org/jira/browse/SPARK-19161). ### Backward incompatible changes: - `pyspark.sql.functions.udf` will return a `function` instead of `UserDefinedFunction`. To ensure backward compatible public API we use function attributes to mimic `UserDefinedFunction` API (`func` and `returnType` attributes). This should have a minimal impact on the user code. An alternative implementation could use dynamical sub-classing. This would ensure full backward compatibility but is more fragile in practice. ### Limitations: Full functionality (retained docstring and argument list) is achieved only in the recent Python version. Legacy Python version will preserve only docstrings, but not argument list. This should be an acceptable trade-off between achieved improvements and overall complexity. ### Possible impact on other tickets: This can affect [SPARK-18777](https://issues.apache.org/jira/browse/SPARK-18777). ## How was this patch tested? Existing unit tests to ensure backward compatibility, additional tests targeting proposed changes. Author: zero323 <zero323@users.noreply.github.com> Closes apache#16534 from zero323/SPARK-19161.
Is this still a problem? Now |
Recently we hit some problems while extending python udf, to support |
What changes were proposed in this pull request?
Replaces
UserDefinedFunction
object returned fromudf
with a function wrapper providing docstring and arguments information as proposed in SPARK-19161.Backward incompatible changes:
pyspark.sql.functions.udf
will return afunction
instead ofUserDefinedFunction
. To ensure backward compatible public API we use function attributes to mimicUserDefinedFunction
API (func
andreturnType
attributes). This should have a minimal impact on the user code.An alternative implementation could use dynamical sub-classing. This would ensure full backward compatibility but is more fragile in practice.
Limitations:
Full functionality (retained docstring and argument list) is achieved only in the recent Python version. Legacy Python version will preserve only docstrings, but not argument list. This should be an acceptable trade-off between achieved improvements and overall complexity.
Possible impact on other tickets:
This can affect SPARK-18777.
How was this patch tested?
Existing unit tests to ensure backward compatibility, additional tests targeting proposed changes.