[SPARK-19161][PYTHON][SQL] Improving UDF Docstrings #16534

zero323 · 2017-01-10T21:22:29Z

What changes were proposed in this pull request?

Replaces UserDefinedFunction object returned from udf with a function wrapper providing docstring and arguments information as proposed in SPARK-19161.

Backward incompatible changes:

pyspark.sql.functions.udf will return a function instead of UserDefinedFunction. To ensure backward compatible public API we use function attributes to mimic UserDefinedFunction API (func and returnType attributes). This should have a minimal impact on the user code.

An alternative implementation could use dynamical sub-classing. This would ensure full backward compatibility but is more fragile in practice.

Limitations:

Full functionality (retained docstring and argument list) is achieved only in the recent Python version. Legacy Python version will preserve only docstrings, but not argument list. This should be an acceptable trade-off between achieved improvements and overall complexity.

Possible impact on other tickets:

This can affect SPARK-18777.

How was this patch tested?

Existing unit tests to ensure backward compatibility, additional tests targeting proposed changes.

SparkQA · 2017-01-10T22:02:11Z

Test build #71160 has finished for PR 16534 at commit 0cedb33.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zero323 · 2017-01-10T23:53:52Z

Note:

We'll need a bit more work if we want to provide correct argument list for legacy Python.

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.2.0-SNAPSHOT
      /_/

Using Python version 2.7.11 (default, Dec  6 2015 18:08:32)
SparkSession available as 'spark'.

In [1]: from pyspark.sql.types import IntegerType

In [2]: from pyspark.sql.functions import udf

In [3]: f = udf(lambda x: x, IntegerType())

In [4]: ?f
Signature: f(*args)
Docstring: SQL Type: IntegerType
File:      ~/Workspace/spark/python/pyspark/sql/functions.py
Type:      function

zero323 · 2017-01-11T00:20:42Z

There is an alternative implementation where we could dynamically subclass UserDefinedFunction. It has benefit of keeping public API intact (udf would remain Callable -> UserDefinedFuncton) but it a a bit awkward in practice.

holdenk · 2017-01-12T19:08:08Z

Improving UDF Docstrings for Python seems like a good idea, but at the cost of breaking the public API in a point release I think it might make sense for us to do the more work approach unless there is a really strong argument for why this part of the API isn't really public. But that's just my thoughts, what maybe @davies has a different opinion?

zero323 · 2017-01-12T19:46:20Z

@holdenk I don't think it should go to the point release at all (same as #16533 which, depending on the resolution, may introduce new functionality or breaking API changes). #16538 went to 2.2 so I think it is a reasonable target for all subtasks in SPARK-19159.

That being said public vs. private is a bit fuzzy here. udf docstring states that it:

Creates a Column expression representing a user defined function (UDF)

and doesn't document return type otherwise. This is obviously not true.

It is also worth noting that we can use a function wrapper without any changes to the API. It is not the most common practice but we can add required attributes to the function to keep full backward compatibility for the time being.

One way or another it would be nice to make it consistent with SPARK-18777 though. If we go with a function wrapper here, it would make sense to use one there as well.

holdenk · 2017-01-12T19:49:00Z

It's a bit hard to follow up wit those during JIRA maintenance window - I'll follow up after JIRA comes back online :)

zero323 · 2017-01-12T19:53:14Z

@holdenk Indeed. Not the most fortunate moment for making a bunch of connected PRs :)

SparkQA · 2017-01-19T23:19:59Z

Test build #71680 has finished for PR 16534 at commit 3bac064.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-20T01:21:58Z

Test build #71685 has finished for PR 16534 at commit 8dd9071.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zero323 · 2017-01-20T14:41:51Z

@holdenk I used function arguments to make sure that public API, though not types, is preserved. Please let me know what you think.

SparkQA · 2017-01-20T15:07:11Z

Test build #71723 has finished for PR 16534 at commit 65411a1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2017-01-26T21:26:34Z

So I'm not super comfortable changing the return type (what about if user code has isinstance checks with UserDefinedFunction?) That being said if @davies or one of the other committers thinks this is an OK change as is I'm fine with that.

zero323 · 2017-01-26T21:43:26Z

Thanks @holdenk! Let's wait for another opinion (maybe @rxin) and if it is not acceptable I'll just close this and ask for closing the ticket. Theoretically we could define a constructor with dynamic type:

type(name, (UserDefinedFunction, ), {"__doc__":  func.__doc__})

but this is way to hacky.

rxin · 2017-01-26T23:50:21Z

Is the goal to change the doc or the repl string? It might be useful to change the repl string but I'm not sure if it is worth changing the doc.

zero323 · 2017-01-27T01:01:35Z

@rxin I am not aware of any straightforward way of separating these two, but I focused on the docstrings anyway. The rationale is simple - I want to be able to:

Create packages containing UDFs.
Get concise syntax with decorators without need for intermediate functions, or nesting.
Import UDFs without side effects.

Have docstrings and argument annotations which correspond to the function I wrap, not a generic UserDefinedFunctionObject - this is what I want to achieve here. As illustrated in the JIRA ticket what we get right now is completely useless:

In [5]: ?add_one
Type:        UserDefinedFunction
String form: <pyspark.sql.functions.UserDefinedFunction object at 0x7f281ed2d198>
File:        ~/Spark/spark-2.0/python/pyspark/sql/functions.py
Signature:   add_one(*cols)
Docstring:
User defined function in Python


.. versionadded:: 1.3

 help(add_one)

Help on UserDefinedFunction in module pyspark.sql.functions object:

class UserDefinedFunction(builtins.object)
 |  User defined function in Python
 |  
 |  .. versionadded:: 1.3
 |  
 |  Methods defined here:
 |  
 |  __call__(self, *cols)
 |      Call self as a function.
 |  
 |  __del__(self)
 |  
 |  __init__(self, func, returnType, name=None)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  ----------------------------------------------------------------------
 |  Data descriptors defined here:
 |  
 |  __dict__
 |      dictionary for instance variables (if defined)
 |  
 |  __weakref__
 |      list of weak references to the object (if defined)
(END)

REPL is definitely the main use case. Handling docs with wraps is much trickier, but there are known workarounds .

SparkQA · 2017-02-01T12:02:46Z

Test build #72242 has finished for PR 16534 at commit 9168009.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-15T21:23:17Z

Test build #72951 has finished for PR 16534 at commit 2a0ac46.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-02-15T23:35:09Z

Test build #72949 has finished for PR 16534 at commit 3b3a41b.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

@udf

This PR adds `udf` decorator syntax as proposed in [SPARK-19160](https://issues.apache.org/jira/browse/SPARK-19160). This allows users to define UDF using simplified syntax: ``` from pyspark.sql.decorators import udf @udf(IntegerType()) def add_one(x): """Adds one""" if x is not None: return x + 1 ``` without need to define a separate function and udf. Tested wiht existing unit tests to ensure backward compatibility and additional unit tests covering new functionality.

SparkQA · 2017-02-16T00:24:23Z

Test build #72966 has finished for PR 16534 at commit 64bba41.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2017-02-16T15:58:16Z

Change looks good to me but I didn't look super carefully.

@holdenk can you take a look at this?

holdenk · 2017-02-16T22:02:17Z

Sure, I'll take another closer look.

holdenk · 2017-02-22T23:52:24Z

So it feels like we are adding an extra layer of indirection unnecessarily, could you use update_wrapper from functools directly on the udf object?

zero323 · 2017-02-23T06:49:16Z

To a very limited extent. It can bring some useful information in IPython / Jupyter (maybe some other tools as well) but won't work with built-in help / pydoc.help.

You can compare:

from functools import wraps

def f(x, *args):
    """This is
    some function"""
    return x

class F():
    def __init__(self, f):
        self.f = f
    def __call__(self, x):
        return f(x)

g = wraps(f)(F(f))

@wraps(f)
def h(x):
    return F(f)(x)

?g
help(g)

?h
help(h)

As far as I am aware it is either this or dynamical inheritance.

holdenk · 2017-02-23T11:39:32Z

I'm not sure about wraps but with update_wrapper, I tested it in a Jupyter kernel and it seems to give all of the docstring and signature information without adding another function dispatch inside of PySpark UDFs.

In IPython

def foo(x):
    """Identity"""
    return x

class F():
    def __init__(self, f):
        self.f = f
    def __call__(self, x):
        return f(x)
a = update_wrapper(F(foo), foo)

results in a help string (from ?a) of:

Call signature: a(x)
Type: instance
Base Class: main.F
String form: <main.F instance at 0x7febb43d6ef0>
Docstring: Identity

Which seems like everything the current implementation does without adding the indirection. Is this not the behavior you are seeing?

zero323 · 2017-02-23T17:57:12Z

update_wrapper works the same way as wraps - it will be useful for IPython, which uses relatively complex inspection rules, but will be useless anywhere when one depends on pydoc.help.

holdenk · 2017-02-24T02:18:45Z

Yes pydoc.help does depend on looking at the docstring on the type rather than the object :( Too bad the IPython magic isn't used in pydoc too.

Sorry for all the back and forth, I'm just trying to see if we can improve the documentation without slowing down our already not-super-fast Python UDF performance - how would you feel about doing a small perf test with Python UDFs to make sure this doesn't cause a regression?

If there is no regression it looks fine, but if there is maybe we should explore the dynamic sub-classing option.

zero323 · 2017-02-24T15:44:14Z

Don't worry, I get it :) The point is to make user experience better not worse, right? In practice:

These changes are pretty far from data, so overall impact is negligible and constant.
For UDF creation overhead is around ~8 microseconds (this doesn't include any JVM communication).
With Py4J call (JUDF and Column creation) everything is bound by JVM communication which has three orders of magnitude higher latency than our Python code.

Rough tests (build 8f33731):

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.2.0-SNAPSHOT
      /_/

Using Python version 3.5.2 (default, Jul  2 2016 17:53:06)
SparkSession available as 'spark'.

In [1]: from pyspark.sql.functions import udf

In [2]: from functools import wraps

In [3]: def wrapped(f):
   ...:     f_ = udf(f)
   ...:     @wraps(f)
   ...:     def wrapped_(*args):
   ...:         return f_(*args)
   ...:     return wrapped_
   ...: 

In [4]: %timeit udf(lambda x: x)
The slowest run took 8.96 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 3.45 µs per loop

In [5]: %timeit wrapped(lambda x: x)
The slowest run took 6.67 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 12.3 µs per loop

In [6]: %timeit udf(lambda x: x)("x")
The slowest run took 13.64 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 11.3 ms per loop

In [7]: %timeit wrapped(lambda x: x)("a")
100 loops, best of 3: 9.9 ms per loop

In [8]: %timeit -n10  spark.range(0, 10000).toDF("id").select(udf(lambda x: x)("id")).rdd.foreach(lambda _: None)
10 loops, best of 3: 227 ms per loop

In [9]: %timeit -n10  spark.range(0, 10000).toDF("id").select(wrapped(lambda x: x)("id")).rdd.foreach(lambda _: None)
10 loops, best of 3: 206 ms per loop

holdenk · 2017-02-24T16:21:11Z

Great! Thanks for doing this, will merge to master :)

holdenk · 2017-02-24T16:25:02Z

Merged to master, thanks @zero323

zero323 · 2017-02-24T16:37:53Z

Thanks @holdenk. I think it should be mentioned as a change of behavior in the release notes. We don't change API, and UserDefinedFunction is hardly public (it is not even included in the docs), nevertheless it is a change.

holdenk · 2017-02-24T16:51:26Z

I agree, just in case someone does have an isinstance check (or similar) we should document the change in the release notes.

## What changes were proposed in this pull request? Replaces `UserDefinedFunction` object returned from `udf` with a function wrapper providing docstring and arguments information as proposed in [SPARK-19161](https://issues.apache.org/jira/browse/SPARK-19161). ### Backward incompatible changes: - `pyspark.sql.functions.udf` will return a `function` instead of `UserDefinedFunction`. To ensure backward compatible public API we use function attributes to mimic `UserDefinedFunction` API (`func` and `returnType` attributes). This should have a minimal impact on the user code. An alternative implementation could use dynamical sub-classing. This would ensure full backward compatibility but is more fragile in practice. ### Limitations: Full functionality (retained docstring and argument list) is achieved only in the recent Python version. Legacy Python version will preserve only docstrings, but not argument list. This should be an acceptable trade-off between achieved improvements and overall complexity. ### Possible impact on other tickets: This can affect [SPARK-18777](https://issues.apache.org/jira/browse/SPARK-18777). ## How was this patch tested? Existing unit tests to ensure backward compatibility, additional tests targeting proposed changes. Author: zero323 <zero323@users.noreply.github.com> Closes apache#16534 from zero323/SPARK-19161.

cloud-fan · 2018-01-03T16:10:13Z

Is this still a problem? Now UserDefinedFunction defines returnType as a property.

cloud-fan · 2018-01-03T16:11:40Z

Recently we hit some problems while extending python udf, to support asNondeterministic, asNonNullable, etc. It's really confusing if the return type is just a python function.

zero323 changed the title ~~[SPARK-19161][PYTHON][SQL][WIP] Improving UDF Docstrings~~ [SPARK-19161][PYTHON][SQL] Improving UDF Docstrings Jan 12, 2017

zero323 mentioned this pull request Jan 12, 2017

[SPARK-19160][PYTHON][SQL] Add udf decorator #16533

Closed

zero323 force-pushed the SPARK-19161 branch from 3bac064 to 8dd9071 Compare January 20, 2017 00:42

zero323 force-pushed the SPARK-19161 branch from 8dd9071 to 65411a1 Compare January 20, 2017 14:29

zero323 force-pushed the SPARK-19161 branch from 65411a1 to 9168009 Compare February 1, 2017 11:28

zero323 force-pushed the SPARK-19161 branch 2 times, most recently from 3b3a41b to 2a0ac46 Compare February 15, 2017 20:43

zero323 added 3 commits February 16, 2017 00:51

Add tests for udf decorator with str returnType

af4a11c

Wrap UserDefinedFunction object with a function and preserve dosctring

64bba41

zero323 force-pushed the SPARK-19161 branch from 2a0ac46 to 64bba41 Compare February 15, 2017 23:51

asfgit closed this in 4a5e38f Feb 24, 2017

zero323 deleted the SPARK-19161 branch April 6, 2017 10:58

cloud-fan mentioned this pull request Jan 4, 2018

[SPARK-22939] [PySpark] Support Spark UDF in registerFunction #20137

Closed

[SPARK-19161][PYTHON][SQL] Improving UDF Docstrings #16534

[SPARK-19161][PYTHON][SQL] Improving UDF Docstrings #16534

Uh oh!

Conversation

zero323 commented Jan 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Backward incompatible changes:

Limitations:

Possible impact on other tickets:

How was this patch tested?

Uh oh!

SparkQA commented Jan 10, 2017

Uh oh!

zero323 commented Jan 10, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zero323 commented Jan 11, 2017

Uh oh!

holdenk commented Jan 12, 2017

Uh oh!

zero323 commented Jan 12, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

holdenk commented Jan 12, 2017

Uh oh!

zero323 commented Jan 12, 2017

Uh oh!

SparkQA commented Jan 19, 2017

Uh oh!

SparkQA commented Jan 20, 2017

Uh oh!

zero323 commented Jan 20, 2017

Uh oh!

SparkQA commented Jan 20, 2017

Uh oh!

holdenk commented Jan 26, 2017

Uh oh!

zero323 commented Jan 26, 2017

Uh oh!

rxin commented Jan 26, 2017

Uh oh!

zero323 commented Jan 27, 2017

Uh oh!

SparkQA commented Feb 1, 2017

Uh oh!

SparkQA commented Feb 15, 2017

Uh oh!

SparkQA commented Feb 15, 2017

Uh oh!

SparkQA commented Feb 16, 2017

Uh oh!

rxin commented Feb 16, 2017

Uh oh!

holdenk commented Feb 16, 2017

Uh oh!

holdenk commented Feb 22, 2017

Uh oh!

zero323 commented Feb 23, 2017

Uh oh!

holdenk commented Feb 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zero323 commented Feb 23, 2017

Uh oh!

holdenk commented Feb 24, 2017

Uh oh!

zero323 commented Feb 24, 2017

Uh oh!

holdenk commented Feb 24, 2017

Uh oh!

holdenk commented Feb 24, 2017

Uh oh!

zero323 commented Feb 24, 2017

Uh oh!

holdenk commented Feb 24, 2017

Uh oh!

cloud-fan commented Jan 3, 2018

Uh oh!

cloud-fan commented Jan 3, 2018

zero323 commented Jan 10, 2017 •

edited

Loading

zero323 commented Jan 10, 2017 •

edited

Loading

zero323 commented Jan 12, 2017 •

edited

Loading

holdenk commented Feb 23, 2017 •

edited

Loading