Skip to content

[SPARK-19161][PYTHON][SQL] Improving UDF Docstrings #16534

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

zero323
Copy link
Member

@zero323 zero323 commented Jan 10, 2017

What changes were proposed in this pull request?

Replaces UserDefinedFunction object returned from udf with a function wrapper providing docstring and arguments information as proposed in SPARK-19161.

Backward incompatible changes:

  • pyspark.sql.functions.udf will return a function instead of UserDefinedFunction. To ensure backward compatible public API we use function attributes to mimic UserDefinedFunction API (func and returnType attributes). This should have a minimal impact on the user code.

    An alternative implementation could use dynamical sub-classing. This would ensure full backward compatibility but is more fragile in practice.

Limitations:

Full functionality (retained docstring and argument list) is achieved only in the recent Python version. Legacy Python version will preserve only docstrings, but not argument list. This should be an acceptable trade-off between achieved improvements and overall complexity.

Possible impact on other tickets:

This can affect SPARK-18777.

How was this patch tested?

Existing unit tests to ensure backward compatibility, additional tests targeting proposed changes.

@SparkQA
Copy link

SparkQA commented Jan 10, 2017

Test build #71160 has finished for PR 16534 at commit 0cedb33.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zero323
Copy link
Member Author

zero323 commented Jan 10, 2017

Note:

We'll need a bit more work if we want to provide correct argument list for legacy Python.

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.2.0-SNAPSHOT
      /_/

Using Python version 2.7.11 (default, Dec  6 2015 18:08:32)
SparkSession available as 'spark'.

In [1]: from pyspark.sql.types import IntegerType

In [2]: from pyspark.sql.functions import udf

In [3]: f = udf(lambda x: x, IntegerType())

In [4]: ?f
Signature: f(*args)
Docstring: SQL Type: IntegerType
File:      ~/Workspace/spark/python/pyspark/sql/functions.py
Type:      function

@zero323
Copy link
Member Author

zero323 commented Jan 11, 2017

There is an alternative implementation where we could dynamically subclass UserDefinedFunction. It has benefit of keeping public API intact (udf would remain Callable -> UserDefinedFuncton) but it a a bit awkward in practice.

@zero323 zero323 changed the title [SPARK-19161][PYTHON][SQL][WIP] Improving UDF Docstrings [SPARK-19161][PYTHON][SQL] Improving UDF Docstrings Jan 12, 2017
@holdenk
Copy link
Contributor

holdenk commented Jan 12, 2017

Improving UDF Docstrings for Python seems like a good idea, but at the cost of breaking the public API in a point release I think it might make sense for us to do the more work approach unless there is a really strong argument for why this part of the API isn't really public. But that's just my thoughts, what maybe @davies has a different opinion?

@zero323
Copy link
Member Author

zero323 commented Jan 12, 2017

@holdenk I don't think it should go to the point release at all (same as #16533 which, depending on the resolution, may introduce new functionality or breaking API changes). #16538 went to 2.2 so I think it is a reasonable target for all subtasks in SPARK-19159.

That being said public vs. private is a bit fuzzy here. udf docstring states that it:

Creates a Column expression representing a user defined function (UDF)

and doesn't document return type otherwise. This is obviously not true.

It is also worth noting that we can use a function wrapper without any changes to the API. It is not the most common practice but we can add required attributes to the function to keep full backward compatibility for the time being.

One way or another it would be nice to make it consistent with SPARK-18777 though. If we go with a function wrapper here, it would make sense to use one there as well.

@holdenk
Copy link
Contributor

holdenk commented Jan 12, 2017

It's a bit hard to follow up wit those during JIRA maintenance window - I'll follow up after JIRA comes back online :)

@zero323
Copy link
Member Author

zero323 commented Jan 12, 2017

@holdenk Indeed. Not the most fortunate moment for making a bunch of connected PRs :)

@SparkQA
Copy link

SparkQA commented Jan 19, 2017

Test build #71680 has finished for PR 16534 at commit 3bac064.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 20, 2017

Test build #71685 has finished for PR 16534 at commit 8dd9071.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zero323
Copy link
Member Author

zero323 commented Jan 20, 2017

@holdenk I used function arguments to make sure that public API, though not types, is preserved. Please let me know what you think.

@SparkQA
Copy link

SparkQA commented Jan 20, 2017

Test build #71723 has finished for PR 16534 at commit 65411a1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Jan 26, 2017

So I'm not super comfortable changing the return type (what about if user code has isinstance checks with UserDefinedFunction?) That being said if @davies or one of the other committers thinks this is an OK change as is I'm fine with that.

@zero323
Copy link
Member Author

zero323 commented Jan 26, 2017

Thanks @holdenk! Let's wait for another opinion (maybe @rxin) and if it is not acceptable I'll just close this and ask for closing the ticket. Theoretically we could define a constructor with dynamic type:

type(name, (UserDefinedFunction, ), {"__doc__":  func.__doc__})

but this is way to hacky.

@rxin
Copy link
Contributor

rxin commented Jan 26, 2017

Is the goal to change the doc or the repl string? It might be useful to change the repl string but I'm not sure if it is worth changing the doc.

@zero323
Copy link
Member Author

zero323 commented Jan 27, 2017

@rxin I am not aware of any straightforward way of separating these two, but I focused on the docstrings anyway. The rationale is simple - I want to be able to:

  • Create packages containing UDFs.

  • Get concise syntax with decorators without need for intermediate functions, or nesting.

  • Import UDFs without side effects.

  • Have docstrings and argument annotations which correspond to the function I wrap, not a generic UserDefinedFunctionObject - this is what I want to achieve here. As illustrated in the JIRA ticket what we get right now is completely useless:

    In [5]: ?add_one
    Type:        UserDefinedFunction
    String form: <pyspark.sql.functions.UserDefinedFunction object at 0x7f281ed2d198>
    File:        ~/Spark/spark-2.0/python/pyspark/sql/functions.py
    Signature:   add_one(*cols)
    Docstring:
    User defined function in Python
    
    
    .. versionadded:: 1.3
    
     help(add_one)
    
    Help on UserDefinedFunction in module pyspark.sql.functions object:
    
    class UserDefinedFunction(builtins.object)
     |  User defined function in Python
     |  
     |  .. versionadded:: 1.3
     |  
     |  Methods defined here:
     |  
     |  __call__(self, *cols)
     |      Call self as a function.
     |  
     |  __del__(self)
     |  
     |  __init__(self, func, returnType, name=None)
     |      Initialize self.  See help(type(self)) for accurate signature.
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors defined here:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
    (END)
    

    REPL is definitely the main use case. Handling docs with wraps is much trickier, but there are known workarounds .

@SparkQA
Copy link

SparkQA commented Feb 1, 2017

Test build #72242 has finished for PR 16534 at commit 9168009.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zero323 zero323 force-pushed the SPARK-19161 branch 2 times, most recently from 3b3a41b to 2a0ac46 Compare February 15, 2017 20:43
@SparkQA
Copy link

SparkQA commented Feb 15, 2017

Test build #72951 has finished for PR 16534 at commit 2a0ac46.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 15, 2017

Test build #72949 has finished for PR 16534 at commit 3b3a41b.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

This PR adds `udf` decorator syntax as proposed in [SPARK-19160](https://issues.apache.org/jira/browse/SPARK-19160).

This allows users to define UDF using simplified syntax:

```
from pyspark.sql.decorators import udf

@udf(IntegerType())
def add_one(x):
    """Adds one"""
    if x is not None:
        return x + 1
```

without need to define a separate function and udf.

Tested wiht existing unit tests to ensure backward compatibility and additional unit tests covering new functionality.
@SparkQA
Copy link

SparkQA commented Feb 16, 2017

Test build #72966 has finished for PR 16534 at commit 64bba41.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Feb 16, 2017

Change looks good to me but I didn't look super carefully.

@holdenk can you take a look at this?

@holdenk
Copy link
Contributor

holdenk commented Feb 16, 2017

Sure, I'll take another closer look.

@holdenk
Copy link
Contributor

holdenk commented Feb 22, 2017

So it feels like we are adding an extra layer of indirection unnecessarily, could you use update_wrapper from functools directly on the udf object?

@zero323
Copy link
Member Author

zero323 commented Feb 23, 2017

To a very limited extent. It can bring some useful information in IPython / Jupyter (maybe some other tools as well) but won't work with built-in help / pydoc.help.

You can compare:

from functools import wraps

def f(x, *args):
    """This is
    some function"""
    return x

class F():
    def __init__(self, f):
        self.f = f
    def __call__(self, x):
        return f(x)

g = wraps(f)(F(f))

@wraps(f)
def h(x):
    return F(f)(x)

?g
help(g)

?h
help(h)

As far as I am aware it is either this or dynamical inheritance.

@holdenk
Copy link
Contributor

holdenk commented Feb 23, 2017

I'm not sure about wraps but with update_wrapper, I tested it in a Jupyter kernel and it seems to give all of the docstring and signature information without adding another function dispatch inside of PySpark UDFs.

In IPython

def foo(x):
    """Identity"""
    return x

class F():
    def __init__(self, f):
        self.f = f
    def __call__(self, x):
        return f(x)
a = update_wrapper(F(foo), foo)

results in a help string (from ?a) of:

Call signature: a(x)
Type: instance
Base Class: main.F
String form: <main.F instance at 0x7febb43d6ef0>
Docstring: Identity

Which seems like everything the current implementation does without adding the indirection. Is this not the behavior you are seeing?

@zero323
Copy link
Member Author

zero323 commented Feb 23, 2017

update_wrapper works the same way as wraps - it will be useful for IPython, which uses relatively complex inspection rules, but will be useless anywhere when one depends on pydoc.help.

@holdenk
Copy link
Contributor

holdenk commented Feb 24, 2017

Yes pydoc.help does depend on looking at the docstring on the type rather than the object :( Too bad the IPython magic isn't used in pydoc too.

Sorry for all the back and forth, I'm just trying to see if we can improve the documentation without slowing down our already not-super-fast Python UDF performance - how would you feel about doing a small perf test with Python UDFs to make sure this doesn't cause a regression?

If there is no regression it looks fine, but if there is maybe we should explore the dynamic sub-classing option.

@zero323
Copy link
Member Author

zero323 commented Feb 24, 2017

Don't worry, I get it :) The point is to make user experience better not worse, right? In practice:

  • These changes are pretty far from data, so overall impact is negligible and constant.
  • For UDF creation overhead is around ~8 microseconds (this doesn't include any JVM communication).
  • With Py4J call (JUDF and Column creation) everything is bound by JVM communication which has three orders of magnitude higher latency than our Python code.

Rough tests (build 8f33731):

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.2.0-SNAPSHOT
      /_/

Using Python version 3.5.2 (default, Jul  2 2016 17:53:06)
SparkSession available as 'spark'.

In [1]: from pyspark.sql.functions import udf

In [2]: from functools import wraps

In [3]: def wrapped(f):
   ...:     f_ = udf(f)
   ...:     @wraps(f)
   ...:     def wrapped_(*args):
   ...:         return f_(*args)
   ...:     return wrapped_
   ...: 

In [4]: %timeit udf(lambda x: x)
The slowest run took 8.96 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 3.45 µs per loop

In [5]: %timeit wrapped(lambda x: x)
The slowest run took 6.67 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 12.3 µs per loop

In [6]: %timeit udf(lambda x: x)("x")
The slowest run took 13.64 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 11.3 ms per loop

In [7]: %timeit wrapped(lambda x: x)("a")
100 loops, best of 3: 9.9 ms per loop

In [8]: %timeit -n10  spark.range(0, 10000).toDF("id").select(udf(lambda x: x)("id")).rdd.foreach(lambda _: None)
10 loops, best of 3: 227 ms per loop

In [9]: %timeit -n10  spark.range(0, 10000).toDF("id").select(wrapped(lambda x: x)("id")).rdd.foreach(lambda _: None)
10 loops, best of 3: 206 ms per loop

@holdenk
Copy link
Contributor

holdenk commented Feb 24, 2017

Great! Thanks for doing this, will merge to master :)

@holdenk
Copy link
Contributor

holdenk commented Feb 24, 2017

Merged to master, thanks @zero323

@asfgit asfgit closed this in 4a5e38f Feb 24, 2017
@zero323
Copy link
Member Author

zero323 commented Feb 24, 2017

Thanks @holdenk. I think it should be mentioned as a change of behavior in the release notes. We don't change API, and UserDefinedFunction is hardly public (it is not even included in the docs), nevertheless it is a change.

@holdenk
Copy link
Contributor

holdenk commented Feb 24, 2017

I agree, just in case someone does have an isinstance check (or similar) we should document the change in the release notes.

Yunni pushed a commit to Yunni/spark that referenced this pull request Feb 27, 2017
## What changes were proposed in this pull request?

Replaces `UserDefinedFunction` object returned from `udf` with a function wrapper providing docstring and arguments information as proposed in [SPARK-19161](https://issues.apache.org/jira/browse/SPARK-19161).

### Backward incompatible changes:

- `pyspark.sql.functions.udf` will return a `function` instead of `UserDefinedFunction`. To ensure backward compatible public API we use function attributes to mimic  `UserDefinedFunction` API (`func` and `returnType` attributes).  This should have a minimal impact on the user code.

  An alternative implementation could use dynamical sub-classing. This would ensure full backward compatibility but is more fragile in practice.

### Limitations:

Full functionality (retained docstring and argument list) is achieved only in the recent Python version. Legacy Python version will preserve only docstrings, but not argument list. This should be an acceptable trade-off between achieved improvements and overall complexity.

### Possible impact on other tickets:

This can affect [SPARK-18777](https://issues.apache.org/jira/browse/SPARK-18777).

## How was this patch tested?

Existing unit tests to ensure backward compatibility, additional tests targeting proposed changes.

Author: zero323 <zero323@users.noreply.github.com>

Closes apache#16534 from zero323/SPARK-19161.
@zero323 zero323 deleted the SPARK-19161 branch April 6, 2017 10:58
@cloud-fan
Copy link
Contributor

Is this still a problem? Now UserDefinedFunction defines returnType as a property.

@cloud-fan
Copy link
Contributor

Recently we hit some problems while extending python udf, to support asNondeterministic, asNonNullable, etc. It's really confusing if the return type is just a python function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants