Skip to content

[SPARK-18069][Doc] improve python API docstrings #15053

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

mortada
Copy link
Contributor

@mortada mortada commented Sep 11, 2016

What changes were proposed in this pull request?

a lot of the python API functions show example usage that is incomplete. The docstring shows output without having the input DataFrame defined. It can be quite confusing trying to understand the example. This PR fixes the docstring.

How was this patch tested?

documentation changes only

Before

After

@srowen
Copy link
Member

srowen commented Sep 12, 2016

Jenkins test this please

@SparkQA
Copy link

SparkQA commented Sep 12, 2016

Test build #65242 has finished for PR 15053 at commit 52240bc.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Sep 12, 2016

Looks like the expected output doesn't match now. Maybe you all can teach me something -- how did this docstring work as a test at all before? or if it doesn't run correctly is the output not checked?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Sep 12, 2016

Oh, it will run doc tests as far as I know, https://docs.python.org/2/library/doctest.html

Maybe I will try to run it locally to check it by myself.

It seems running with the codes below:

OK

def my_function(a, b):
    """
    >>> my_function(2, 3)
    6
    """
    return a * b

prints

$  python -m doctest -v test.py
Trying:
    my_function(2, 3)
Expecting:
    6
ok
1 items had no tests:
    test
1 items passed all tests:
   1 tests in test.my_function
1 tests in 2 items.
1 passed and 0 failed.
Test passed.

Failure

def my_function(a, b):
    """
    >>> my_function(2)
    6
    """
    return a * b

prints

$  python -m doctest -v test.py
Trying:
    my_function(2)
Expecting:
    6
**********************************************************************
File "test.py", line 3, in test.my_function
Failed example:
    my_function(2)
Exception raised:
    Traceback (most recent call last):
      File ".../python2.7/doctest.py", line 1315, in __run
        compileflags, 1) in test.globs
      File "<doctest test.my_function[0]>", line 1, in <module>
        my_function(2)
    TypeError: my_function() takes exactly 2 arguments (1 given)
1 items had no tests:
    test
**********************************************************************
1 items had failures:
   1 of   1 in test.my_function
1 tests in 2 items.
0 passed and 1 failed.
***Test Failed*** 1 failures.

@HyukjinKwon
Copy link
Member

BTW - If you meant how they passed before not how they ran, it seems some dataframes were created before actually running the tests, dataframe.py#L1600-L1617.

@@ -354,6 +366,7 @@ def limit(self, num):
def take(self, num):
"""Returns the first ``num`` rows as a :class:`list` of :class:`Row`.

>>> df = spark.createDataFrame([('Alice', 2), ('Bob', 5)], ['name', 'age'])
>>> df.take(2)
[Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
Copy link
Member

@HyukjinKwon HyukjinKwon Sep 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Row(name=u'Alice', age=2), Row(name=u'Bob', age=5)] or correct df

@HyukjinKwon
Copy link
Member

@srowen Would this sensible if we remove all the predefined dataframes as globals and convert them to within each doctest if this change looks good?

@srowen
Copy link
Member

srowen commented Sep 15, 2016

@HyukjinKwon I think you'd know better than I. I tend to agree it would make the examples more self-contained and readable. It would lead to some more duplication of code I guess, but if you judge it's worth it I'd follow your lead.

@HyukjinKwon
Copy link
Member

@srowen - Thanks for your feedback. In my personal opinion, I believe examples in documentation should be self-contained, at least, to allow copy and paste them to test and let users understand the input and output easily. I prefer verbosity and consistency.

As the output is being printed already in the examples, I think input should be written in the example. If I were a newbie in PySpark, I guess it'd take a while to understand what each function does in http://spark.apache.org/docs/latest/api/python/pyspark.sql.html.

@holdenk
Copy link
Contributor

holdenk commented Sep 15, 2016

What about if for the shared DF we also put the comment in the package level docstring so people reading the documentation would see it first but we wouldn't need to have it repeated for all of the tests?

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Sep 16, 2016

@holdenk Personally, I like package level docstring more if we can write them pretty and well. I guess it will be minimal change.
(If we are not sure on that, then, I think we can maybe do this for each as a safe choice though).

@HyukjinKwon
Copy link
Member

Hi @mortada, do you mind if I ask to address mine or @holdenk' comment? If you find any problem with testing, I am willing to take over this which I will ask comitters to credit this to you.

@mortada
Copy link
Contributor Author

mortada commented Sep 18, 2016

@HyukjinKwon thanks for your help! I'm happy to complete this PR and follow what you suggest for testing.

How would the package level docstring work? The goal (which I think we all agree on) is to be able to let the user easily see how the input is set up for each function in the docstring in a self-contained way.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Sep 18, 2016

I haven't checked if package level docstring can define global level variables to be accessed from other docstrings. So, I would like to defer this to @holdenk (if you are not sure too, then we all together can look into this deeper) although I guess it is fine just move the codes of shared dataframes upto top somewhere within a docstring.

I think it'd better if we follow @holdenk's comment first as I blieve this requires minimal change first and let's see if we all agree with fixing it in that way. If not, we could follow my comment as a safe choice.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Sep 19, 2016

@mortada , I am sorry that I am a bit being noisy here but I just took a look for myself.

I resembled the PySpark structure and made a draft for myself.

"""
>>> print df
[1, 2, 3]
>>> print df2
[]
"""

import doctest
import sys


def my_function(a, b):
    """
    >>> my_function(df[0], df[1])
    2
    """
    return a * b


if __name__ == "__main__":
    globs = globals()
    globs['df'] = [1, 2, 3]
    globs['df2'] = []
    doctest.testmod(sys.modules[__name__], globs=globs)

If we do this, we would be able to just add

"""
>>> print df
[1, 2, 3]
>>> print df2
[]
"""

on the top as a docstring and remove all the duplicated dataframe definitions in each doc string.

I guess this will be a minimal change and also satisfy our demands here with making the docstring clear.

You could try to run this by yourself, for example, by

 python test.py -v

I haven't run and tested this against PySpark. Though, maybe we can try this one in this way.

@mortada
Copy link
Contributor Author

mortada commented Sep 19, 2016

@HyukjinKwon I understand we can have py.test and doctest, but I don't quite see how we could define the input DataFrame globally while at the same time have a clear, self-contained docstring for each function?

@holdenk could you please elaborate on what you mean?

If we want to repeat something like this in every docstring

>>> print(df.collect())

we might as well simply include how to actually create the DataFrame so the user can easily reproduce the example?

It seems to me that the user would often want to see the docstring to understand how a function works, and they may not be looking at some global documentation as a whole. And the fact that many of the input DataFrames are the same is really just a convenience for the doc writer and not a requirement.

For instance this is the docstring for a numpy method (numpy.argmax), and the example is with the input clearly defined:

Examples
--------
>>> a = np.arange(6).reshape(2,3)
>>> a
array([[0, 1, 2],
       [3, 4, 5]])
>>> np.argmax(a)
5
>>> np.argmax(a, axis=0)
array([1, 1, 1])
>>> np.argmax(a, axis=1)
array([2, 2])

IMHO it seems odd to require the user to look at some global doc in order to follow the example usage for one single function

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Sep 19, 2016

Oh, I meant just don't touch the globs in _test() but just print the global dataframes (which should be rather show() to show the contents) so that users can understand the input and output. So, my scenario is, users open the doc and see the input dataframes first and then check the function documentation they want.

I do like the idea of defining each within each docstring but maybe we could try the minimal change suggestion to reach the same goal first.

I guess she meant defining the global dataframes within the package docstring which is, for example,

"""
# here
"""

import doctest
import sys
...

However, after investigating, it seems we can't easily do that. So, I suggested to print them in the package docstring such as

"""
>>> df.show()
+-----+
|field|
+-----+
|value|
|value|
+-----+
"""

import doctest
import sys
...

rather than defining them in the package docstring.

@holdenk
Copy link
Contributor

holdenk commented Sep 19, 2016

I was thinking that the user would probably read the package string documentation before looking at the individual functions (or if they went looking for the definition of the dataframe). I'm a little hesitant to add a bunch of copy and paste code so its possible I'm being overly cautious.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Sep 19, 2016

@holdenk I am also cautious but leaving everything as it is but just adding df.show() in the package docstring with cleaning up duplicated defining dataframes in each docstring will be minimal change and fine for the goal.

@holdenk
Copy link
Contributor

holdenk commented Sep 19, 2016

Oh yes I agree that's a much smaller change - I was just explaining the motivation behind my initial comment as mortada asked me to elaborate.

@@ -411,7 +415,7 @@ def monotonically_increasing_id():

The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.
The current implementation puts the partition ID in the upper 31 bits, and the record number
within each partition in the lower 33 bits. The assumption is that the data frame has
within each partition in the lower 33 bits. The assumption is that the DataFrame has
Copy link
Member

@HyukjinKwon HyukjinKwon Sep 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nicer if we use :class:DataFrame`` instead of the plain string DateFrame if you would like to change this. This makes this as a link to the `DataFrame` class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@HyukjinKwon great idea, will update

@srowen
Copy link
Member

srowen commented Sep 25, 2016

@mortada what do you think of @HyukjinKwon 's suggestions?

@HyukjinKwon
Copy link
Member

Hm.. @mortada It seems the dataframe definition is added into each docstring. Could you maybe remove the duplicates and then just print the contents in the package level docstring?

@mortada
Copy link
Contributor Author

mortada commented Sep 26, 2016

In other words, currently it is not possible for the user to follow the examples in this docstring below. It's not clear what all these input variables (df, df2, etc) are, and where you'd even find them

In [5]: DataFrame.join?
Signature: DataFrame.join(self, other, on=None, how=None)
Docstring:
Joins with another :class:`DataFrame`, using the given join expression.

:param other: Right side of the join
:param on: a string for the join column name, a list of column names,
    a join expression (Column), or a list of Columns.
    If `on` is a string or a list of strings indicating the name of the join column(s),
    the column(s) must exist on both sides, and this performs an equi-join.
:param how: str, default 'inner'.
    One of `inner`, `outer`, `left_outer`, `right_outer`, `leftsemi`.

The following performs a full outer join between ``df1`` and ``df2``.

>>> df.join(df2, df.name == df2.name, 'outer').select(df.name, df2.height).collect()
[Row(name=None, height=80), Row(name='Bob', height=85), Row(name='Alice', height=None)]

>>> df.join(df2, 'name', 'outer').select('name', 'height').collect()
[Row(name='Tom', height=80), Row(name='Bob', height=85), Row(name='Alice', height=None)]

>>> cond = [df.name == df3.name, df.age == df3.age]
>>> df.join(df3, cond, 'outer').select(df.name, df3.age).collect()
[Row(name='Alice', age=2), Row(name='Bob', age=5)]

>>> df.join(df2, 'name').select(df.name, df2.height).collect()
[Row(name='Bob', height=85)]

>>> df.join(df4, ['name', 'age']).select(df.name, df.age).collect()
[Row(name='Bob', age=5)]

.. versionadded:: 1.3
File:      ~/code/spark/python/pyspark/sql/dataframe.py
Type:      function

@HyukjinKwon
Copy link
Member

@mortada yes, I understand your point. However, I'd like to avoid change a lot and I guess @holdenk feels in the same way (she is insightful in Python).

I guess it'd not take a lot of time to fix this. Let's do it in this way first. Then, I am happy to support to build the documentation and then leave the images here so that we can decide which way is better.

I can't currently imagine how the documentation would look like or how much it'd be pretty or clean with package level docstring one.

So, it'd be nicer if we have some images or things like we can discuss further about.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Sep 26, 2016

@mortada Otherwise, let me please try this in my local and post some images here by myself if you can bear with this few days :)

@HyukjinKwon
Copy link
Member

OK, I just manually fixed and built the docs.

OK, I tend to agree with Suggestion A looks better but I am worried if it would make PySpark tests slower down as it now creates each dataframe for each test. @holdenk and @srowen WDYT?

@HyukjinKwon
Copy link
Member

IMHO, i think the original author intended to avoid creating each dataframe in each method docstring. I am uncertain about doing each method docstring although I do prefer verbose/complete documentation. If we should go for the Suggestion A, then I think we chould

  • verify if there is any degradation from this cahnge. (e.g. test time increase) and then decide if this change is worth it. I mean we should be able to defence with a concreate reason or reference not only now here but also in the future.
  • find another way to reach this goal.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Sep 26, 2016

Also, I guess this change will be an example and will affect all the Python documentation and worry disagreement from others in the future. So, I'd rather set a clear reason and investigation further.

I think the reference (numpy) is okay and a great one. Could we maybe at least check the elapsed time for doc tests if feasible?

@mortada
Copy link
Contributor Author

mortada commented Sep 29, 2016

@HyukjinKwon do you know how I could run the doctests for these files? I found this online: https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals which says that I could do

SPARK_TESTING=1 ./bin/pyspark python/pyspark/my_file.py.

but that doesn't actually work

$ SPARK_TESTING=1 ./bin/pyspark python/pyspark/sql/dataframe.py
python: Error while finding spec for 'python/pyspark/sql/dataframe.py' (ImportError: No module named 'python/pyspark/sql/dataframe')

in terms of the extra time to create these small DataFrames, I think we'd probably be looking at something reasonably negligible (a few seconds tops)

In [2]: %timeit df = spark.createDataFrame([('Alice', 2), ('Bob', 5)], ['name', 'age'])
100 loops, best of 3: 10.3 ms per loop

i.e. this should be on the order of 1 second even with creating df 100 times.

@HyukjinKwon
Copy link
Member

@mortada I see. Then,

  • It'd not almost increase the test time.
  • It will improve the documentation.
  • We have a reference for this, numpy

Therefore, +1 for each docstring within each method. Can we remove the dataframes in globs and then do the same thing across PySpark documentation?

I just took a quicky look and it seems we have more instances such as ..

$ grep -r "globs\['df" .

./pyspark/sql/column.py:    globs['df'] = sc.parallelize([(2, 'Alice'), (5, 'Bob')]) \
./pyspark/sql/context.py:    globs['df'] = rdd.toDF()
./pyspark/sql/dataframe.py:    globs['df'] = sc.parallelize([(2, 'Alice'), (5, 'Bob')])\
./pyspark/sql/dataframe.py:    globs['df2'] = sc.parallelize([Row(name='Tom', height=80), Row(name='Bob', height=85)]).toDF()
./pyspark/sql/dataframe.py:    globs['df3'] = sc.parallelize([Row(name='Alice', age=2),
./pyspark/sql/dataframe.py:    globs['df4'] = sc.parallelize([Row(name='Alice', age=10, height=80),
./pyspark/sql/functions.py:    globs['df'] = sc.parallelize([Row(name='Alice', age=2), Row(name='Bob', age=5)]).toDF()
./pyspark/sql/group.py:    globs['df'] = sc.parallelize([(2, 'Alice'), (5, 'Bob')]) \
./pyspark/sql/group.py:    globs['df3'] = sc.parallelize([Row(name='Alice', age=2, height=80),
./pyspark/sql/group.py:    globs['df4'] = sc.parallelize([Row(course="dotNET", year=2012, earnings=10000),
./pyspark/sql/readwriter.py:    globs['df'] = spark.read.parquet('python/test_support/sql/parquet_partitioned')
./pyspark/sql/session.py:    globs['df'] = rdd.toDF()
./pyspark/sql/streaming.py:    globs['df'] = \

Also, I think we'd better have a JIRA.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Sep 30, 2016

BTW, it'd be even nicer if the images in #15053 (comment) are included in the PR description as we have the images already.

Just copy and paste

**Before**
<img width="500" src="https://cloud.githubusercontent.com/assets/6477701/18833109/11098e5c-842a-11e6-9e48-6b3cae7c10b0.png">

**After**
<img width="500" src="https://cloud.githubusercontent.com/assets/6477701/18833116/1a400f82-842a-11e6-9da4-e4db0197e9e7.png">

@mortada mortada changed the title [Doc] improve python API docstrings [SPARK-18069][Doc] improve python API docstrings Oct 24, 2016
@mortada
Copy link
Contributor Author

mortada commented Oct 24, 2016

@HyukjinKwon I've created a JIRA ticket and also went through all the files you mentioned, please take a look

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Oct 24, 2016

@srowen Do you mind if I ask to trigger the build via Jenkins if it looks reasonable (or maybe "ok to test")? Let me try to take a look closer within tomorrow but I would like @holdenk also to take a look please.

@SparkQA
Copy link

SparkQA commented Oct 24, 2016

Test build #3371 has finished for PR 15053 at commit d384a95.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

>>> df.select(df.age.cast("string").alias('ages')).collect()
[Row(ages=u'2'), Row(ages=u'5')]
[Row(ages='2'), Row(ages='5')]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly, this test breaks the tests in Python 2.x. Let's use unicodes here.

Copy link
Member

@HyukjinKwon HyukjinKwon Oct 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, the unicode string in Python 3 does not have prefix 'u'. So, this is being handled via @ignore_unicode_prefix.

>>> df.select(df.age.cast(StringType()).alias('ages')).collect()
[Row(ages=u'2'), Row(ages=u'5')]
[Row(ages='2'), Row(ages='5')]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, too. Let us please sweep it back.

from pyspark.context import SparkContext
from pyspark.sql import Row, SQLContext
import pyspark.sql.context

Copy link
Member

@HyukjinKwon HyukjinKwon Oct 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave the newline here. Usually the block of imports has a newline before the codes (I believe pep8 will complain).

@@ -121,6 +121,7 @@ def __init__(self, jsqm):
def active(self):
"""Returns a list of active queries associated with this SQLContext

>>> sdf = spark.readStream.format('text').load('python/test_support/sql/streaming')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might be able define a global path in _test() and reuse the variable for'python/test_support/sql/streaming' because I guess that path does not really mean anything to users.

I (personally) prefer to follow the reference numpy here but if there is no case for this, then we could define a global variable and reuse it.

>>> sdf_schema = StructType([StructField("data", StringType(), False)])
>>> json_sdf = (spark.readStream.format("json")
... .schema(sdf_schema)
... .load(tempfile.mkdtemp()))
Copy link
Member

@HyukjinKwon HyukjinKwon Oct 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly, I believe the removed syntax is encouraged for consistency as I guess from the other codes and documentation. We might better revert the syntax back if there is not a strong specific reason.

@mortada
Copy link
Contributor Author

mortada commented Jan 28, 2017

@HyukjinKwon my apologies for not having been able to follow up on this. I still think this doc improvement would be very helpful to pyspark users. Would you like to take over the PR?

Thanks

@HyukjinKwon
Copy link
Member

Sure, let me try to pick the commits here and open another with cc'ing you and all here soon.

@HyukjinKwon
Copy link
Member

Thank you very much for your update.

@holdenk
Copy link
Contributor

holdenk commented Jan 30, 2017

Give me a shout once the new PR is ready @HyukjinKwon - in the meantime maybe it would be OK to close this one so it doesn't keep showing up in the dashboard?

@HyukjinKwon
Copy link
Member

Thank you @holdenk I definitely will. +1 for closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants