-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-18069][Doc] improve python API docstrings #15053
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Jenkins test this please |
Test build #65242 has finished for PR 15053 at commit
|
Looks like the expected output doesn't match now. Maybe you all can teach me something -- how did this docstring work as a test at all before? or if it doesn't run correctly is the output not checked? |
Oh, it will run doc tests as far as I know, https://docs.python.org/2/library/doctest.html Maybe I will try to run it locally to check it by myself. It seems running with the codes below: OK def my_function(a, b):
"""
>>> my_function(2, 3)
6
"""
return a * b prints $ python -m doctest -v test.py
Trying:
my_function(2, 3)
Expecting:
6
ok
1 items had no tests:
test
1 items passed all tests:
1 tests in test.my_function
1 tests in 2 items.
1 passed and 0 failed.
Test passed. Failure def my_function(a, b):
"""
>>> my_function(2)
6
"""
return a * b prints $ python -m doctest -v test.py
Trying:
my_function(2)
Expecting:
6
**********************************************************************
File "test.py", line 3, in test.my_function
Failed example:
my_function(2)
Exception raised:
Traceback (most recent call last):
File ".../python2.7/doctest.py", line 1315, in __run
compileflags, 1) in test.globs
File "<doctest test.my_function[0]>", line 1, in <module>
my_function(2)
TypeError: my_function() takes exactly 2 arguments (1 given)
1 items had no tests:
test
**********************************************************************
1 items had failures:
1 of 1 in test.my_function
1 tests in 2 items.
0 passed and 1 failed.
***Test Failed*** 1 failures. |
BTW - If you meant how they passed before not how they ran, it seems some dataframes were created before actually running the tests, dataframe.py#L1600-L1617. |
@@ -354,6 +366,7 @@ def limit(self, num): | |||
def take(self, num): | |||
"""Returns the first ``num`` rows as a :class:`list` of :class:`Row`. | |||
|
|||
>>> df = spark.createDataFrame([('Alice', 2), ('Bob', 5)], ['name', 'age']) | |||
>>> df.take(2) | |||
[Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Row(name=u'Alice', age=2), Row(name=u'Bob', age=5)]
or correct df
@srowen Would this sensible if we remove all the predefined dataframes as globals and convert them to within each doctest if this change looks good? |
52240bc
to
1efaaf5
Compare
@HyukjinKwon I think you'd know better than I. I tend to agree it would make the examples more self-contained and readable. It would lead to some more duplication of code I guess, but if you judge it's worth it I'd follow your lead. |
@srowen - Thanks for your feedback. In my personal opinion, I believe examples in documentation should be self-contained, at least, to allow copy and paste them to test and let users understand the input and output easily. I prefer verbosity and consistency. As the output is being printed already in the examples, I think input should be written in the example. If I were a newbie in PySpark, I guess it'd take a while to understand what each function does in http://spark.apache.org/docs/latest/api/python/pyspark.sql.html. |
What about if for the shared DF we also put the comment in the package level docstring so people reading the documentation would see it first but we wouldn't need to have it repeated for all of the tests? |
@holdenk Personally, I like package level docstring more if we can write them pretty and well. I guess it will be minimal change. |
@HyukjinKwon thanks for your help! I'm happy to complete this PR and follow what you suggest for testing. How would the package level docstring work? The goal (which I think we all agree on) is to be able to let the user easily see how the input is set up for each function in the docstring in a self-contained way. |
I haven't checked if package level docstring can define global level variables to be accessed from other docstrings. So, I would like to defer this to @holdenk (if you are not sure too, then we all together can look into this deeper) although I guess it is fine just move the codes of shared dataframes upto top somewhere within a docstring. I think it'd better if we follow @holdenk's comment first as I blieve this requires minimal change first and let's see if we all agree with fixing it in that way. If not, we could follow my comment as a safe choice. |
@mortada , I am sorry that I am a bit being noisy here but I just took a look for myself. I resembled the PySpark structure and made a draft for myself. """
>>> print df
[1, 2, 3]
>>> print df2
[]
"""
import doctest
import sys
def my_function(a, b):
"""
>>> my_function(df[0], df[1])
2
"""
return a * b
if __name__ == "__main__":
globs = globals()
globs['df'] = [1, 2, 3]
globs['df2'] = []
doctest.testmod(sys.modules[__name__], globs=globs) If we do this, we would be able to just add """
>>> print df
[1, 2, 3]
>>> print df2
[]
""" on the top as a docstring and remove all the duplicated dataframe definitions in each doc string. I guess this will be a minimal change and also satisfy our demands here with making the docstring clear. You could try to run this by yourself, for example, by python test.py -v I haven't run and tested this against PySpark. Though, maybe we can try this one in this way. |
@HyukjinKwon I understand we can have @holdenk could you please elaborate on what you mean? If we want to repeat something like this in every docstring
we might as well simply include how to actually create the DataFrame so the user can easily reproduce the example? It seems to me that the user would often want to see the docstring to understand how a function works, and they may not be looking at some global documentation as a whole. And the fact that many of the input DataFrames are the same is really just a convenience for the doc writer and not a requirement. For instance this is the docstring for a numpy method (
IMHO it seems odd to require the user to look at some global doc in order to follow the example usage for one single function |
Oh, I meant just don't touch the I do like the idea of defining each within each docstring but maybe we could try the minimal change suggestion to reach the same goal first. I guess she meant defining the global dataframes within the package docstring which is, for example, """
# here
"""
import doctest
import sys
... However, after investigating, it seems we can't easily do that. So, I suggested to print them in the package docstring such as """
>>> df.show()
+-----+
|field|
+-----+
|value|
|value|
+-----+
"""
import doctest
import sys
... rather than defining them in the package docstring. |
I was thinking that the user would probably read the package string documentation before looking at the individual functions (or if they went looking for the definition of the dataframe). I'm a little hesitant to add a bunch of copy and paste code so its possible I'm being overly cautious. |
@holdenk I am also cautious but leaving everything as it is but just adding |
Oh yes I agree that's a much smaller change - I was just explaining the motivation behind my initial comment as mortada asked me to elaborate. |
@@ -411,7 +415,7 @@ def monotonically_increasing_id(): | |||
|
|||
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. | |||
The current implementation puts the partition ID in the upper 31 bits, and the record number | |||
within each partition in the lower 33 bits. The assumption is that the data frame has | |||
within each partition in the lower 33 bits. The assumption is that the DataFrame has |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nicer if we use :class:
DataFrame`` instead of the plain string DateFrame
if you would like to change this. This makes this as a link to the `DataFrame` class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon great idea, will update
@mortada what do you think of @HyukjinKwon 's suggestions? |
1efaaf5
to
831442c
Compare
Hm.. @mortada It seems the dataframe definition is added into each docstring. Could you maybe remove the duplicates and then just print the contents in the package level docstring? |
In other words, currently it is not possible for the user to follow the examples in this docstring below. It's not clear what all these input variables (
|
@mortada yes, I understand your point. However, I'd like to avoid change a lot and I guess @holdenk feels in the same way (she is insightful in Python). I guess it'd not take a lot of time to fix this. Let's do it in this way first. Then, I am happy to support to build the documentation and then leave the images here so that we can decide which way is better. I can't currently imagine how the documentation would look like or how much it'd be pretty or clean with package level docstring one. So, it'd be nicer if we have some images or things like we can discuss further about. |
@mortada Otherwise, let me please try this in my local and post some images here by myself if you can bear with this few days :) |
OK, I just manually fixed and built the docs.
OK, I tend to agree with Suggestion A looks better but I am worried if it would make PySpark tests slower down as it now creates each dataframe for each test. @holdenk and @srowen WDYT? |
831442c
to
5ae5c37
Compare
IMHO, i think the original author intended to avoid creating each dataframe in each method docstring. I am uncertain about doing each method docstring although I do prefer verbose/complete documentation. If we should go for the Suggestion A, then I think we chould
|
Also, I guess this change will be an example and will affect all the Python documentation and worry disagreement from others in the future. So, I'd rather set a clear reason and investigation further. I think the reference ( |
@HyukjinKwon do you know how I could run the doctests for these files? I found this online: https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals which says that I could do
but that doesn't actually work
in terms of the extra time to create these small DataFrames, I think we'd probably be looking at something reasonably negligible (a few seconds tops)
i.e. this should be on the order of 1 second even with creating |
@mortada I see. Then,
Therefore, +1 for each docstring within each method. Can we remove the dataframes in I just took a quicky look and it seems we have more instances such as ..
Also, I think we'd better have a JIRA. |
BTW, it'd be even nicer if the images in #15053 (comment) are included in the PR description as we have the images already. Just copy and paste
|
5ae5c37
to
4c57c23
Compare
4c57c23
to
d384a95
Compare
@HyukjinKwon I've created a JIRA ticket and also went through all the files you mentioned, please take a look |
Test build #3371 has finished for PR 15053 at commit
|
>>> df.select(df.age.cast("string").alias('ages')).collect() | ||
[Row(ages=u'2'), Row(ages=u'5')] | ||
[Row(ages='2'), Row(ages='5')] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I remember correctly, this test breaks the tests in Python 2.x. Let's use unicodes here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, the unicode string in Python 3 does not have prefix 'u'. So, this is being handled via @ignore_unicode_prefix
.
>>> df.select(df.age.cast(StringType()).alias('ages')).collect() | ||
[Row(ages=u'2'), Row(ages=u'5')] | ||
[Row(ages='2'), Row(ages='5')] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, too. Let us please sweep it back.
from pyspark.context import SparkContext | ||
from pyspark.sql import Row, SQLContext | ||
import pyspark.sql.context | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave the newline here. Usually the block of imports has a newline before the codes (I believe pep8 will complain).
@@ -121,6 +121,7 @@ def __init__(self, jsqm): | |||
def active(self): | |||
"""Returns a list of active queries associated with this SQLContext | |||
|
|||
>>> sdf = spark.readStream.format('text').load('python/test_support/sql/streaming') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might be able define a global path in _test()
and reuse the variable for'python/test_support/sql/streaming'
because I guess that path does not really mean anything to users.
I (personally) prefer to follow the reference numpy
here but if there is no case for this, then we could define a global variable and reuse it.
>>> sdf_schema = StructType([StructField("data", StringType(), False)]) | ||
>>> json_sdf = (spark.readStream.format("json") | ||
... .schema(sdf_schema) | ||
... .load(tempfile.mkdtemp())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly, I believe the removed syntax is encouraged for consistency as I guess from the other codes and documentation. We might better revert the syntax back if there is not a strong specific reason.
@HyukjinKwon my apologies for not having been able to follow up on this. I still think this doc improvement would be very helpful to pyspark users. Would you like to take over the PR? Thanks |
Sure, let me try to pick the commits here and open another with cc'ing you and all here soon. |
Thank you very much for your update. |
Give me a shout once the new PR is ready @HyukjinKwon - in the meantime maybe it would be OK to close this one so it doesn't keep showing up in the dashboard? |
Thank you @holdenk I definitely will. +1 for closing. |
What changes were proposed in this pull request?
a lot of the python API functions show example usage that is incomplete. The docstring shows output without having the input DataFrame defined. It can be quite confusing trying to understand the example. This PR fixes the docstring.
How was this patch tested?
documentation changes only
Before

After
