Skip to content

Commit 5cd79c3

Browse files
nchammasrxin
authored andcommitted
[SPARK-16772] Correct API doc references to PySpark classes + formatting fixes
## What's Been Changed The PR corrects several broken or missing class references in the Python API docs. It also correct formatting problems. For example, you can see [here](http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#pyspark.sql.SQLContext.registerFunction) how Sphinx is not picking up the reference to `DataType`. That's because the reference is relative to the current module, whereas `DataType` is in a different module. You can also see [here](http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#pyspark.sql.SQLContext.createDataFrame) how the formatting for byte, tinyint, and so on is italic instead of monospace. That's because in ReST single backticks just make things italic, unlike in Markdown. ## Testing I tested this PR by [building the Python docs](https://github.com/apache/spark/tree/master/docs#generating-the-documentation-html) and reviewing the results locally in my browser. I confirmed that the broken or missing class references were resolved, and that the formatting was corrected. Author: Nicholas Chammas <nicholas.chammas@gmail.com> Closes #14393 from nchammas/python-docstring-fixes. (cherry picked from commit 274f3b9) Signed-off-by: Reynold Xin <rxin@databricks.com>
1 parent fb09a69 commit 5cd79c3

File tree

8 files changed

+75
-58
lines changed

8 files changed

+75
-58
lines changed

python/pyspark/sql/catalog.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -193,7 +193,7 @@ def registerFunction(self, name, f, returnType=StringType()):
193193
194194
:param name: name of the UDF
195195
:param f: python function
196-
:param returnType: a :class:`DataType` object
196+
:param returnType: a :class:`pyspark.sql.types.DataType` object
197197
198198
>>> spark.catalog.registerFunction("stringLengthString", lambda x: len(x))
199199
>>> spark.sql("SELECT stringLengthString('test')").collect()

python/pyspark/sql/context.py

Lines changed: 25 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -152,9 +152,9 @@ def udf(self):
152152
@since(1.4)
153153
def range(self, start, end=None, step=1, numPartitions=None):
154154
"""
155-
Create a :class:`DataFrame` with single LongType column named `id`,
156-
containing elements in a range from `start` to `end` (exclusive) with
157-
step value `step`.
155+
Create a :class:`DataFrame` with single :class:`pyspark.sql.types.LongType` column named
156+
``id``, containing elements in a range from ``start`` to ``end`` (exclusive) with
157+
step value ``step``.
158158
159159
:param start: the start value
160160
:param end: the end value (exclusive)
@@ -184,7 +184,7 @@ def registerFunction(self, name, f, returnType=StringType()):
184184
185185
:param name: name of the UDF
186186
:param f: python function
187-
:param returnType: a :class:`DataType` object
187+
:param returnType: a :class:`pyspark.sql.types.DataType` object
188188
189189
>>> sqlContext.registerFunction("stringLengthString", lambda x: len(x))
190190
>>> sqlContext.sql("SELECT stringLengthString('test')").collect()
@@ -209,7 +209,7 @@ def _inferSchema(self, rdd, samplingRatio=None):
209209
210210
:param rdd: an RDD of Row or tuple
211211
:param samplingRatio: sampling ratio, or no sampling (default)
212-
:return: StructType
212+
:return: :class:`pyspark.sql.types.StructType`
213213
"""
214214
return self.sparkSession._inferSchema(rdd, samplingRatio)
215215

@@ -226,28 +226,34 @@ def createDataFrame(self, data, schema=None, samplingRatio=None):
226226
from ``data``, which should be an RDD of :class:`Row`,
227227
or :class:`namedtuple`, or :class:`dict`.
228228
229-
When ``schema`` is :class:`DataType` or datatype string, it must match the real data, or
230-
exception will be thrown at runtime. If the given schema is not StructType, it will be
231-
wrapped into a StructType as its only field, and the field name will be "value", each record
232-
will also be wrapped into a tuple, which can be converted to row later.
229+
When ``schema`` is :class:`pyspark.sql.types.DataType` or
230+
:class:`pyspark.sql.types.StringType`, it must match the
231+
real data, or an exception will be thrown at runtime. If the given schema is not
232+
:class:`pyspark.sql.types.StructType`, it will be wrapped into a
233+
:class:`pyspark.sql.types.StructType` as its only field, and the field name will be "value",
234+
each record will also be wrapped into a tuple, which can be converted to row later.
233235
234236
If schema inference is needed, ``samplingRatio`` is used to determined the ratio of
235237
rows used for schema inference. The first row will be used if ``samplingRatio`` is ``None``.
236238
237-
:param data: an RDD of any kind of SQL data representation(e.g. row, tuple, int, boolean,
238-
etc.), or :class:`list`, or :class:`pandas.DataFrame`.
239-
:param schema: a :class:`DataType` or a datatype string or a list of column names, default
240-
is None. The data type string format equals to `DataType.simpleString`, except that
241-
top level struct type can omit the `struct<>` and atomic types use `typeName()` as
242-
their format, e.g. use `byte` instead of `tinyint` for ByteType. We can also use `int`
243-
as a short name for IntegerType.
239+
:param data: an RDD of any kind of SQL data representation(e.g. :class:`Row`,
240+
:class:`tuple`, ``int``, ``boolean``, etc.), or :class:`list`, or
241+
:class:`pandas.DataFrame`.
242+
:param schema: a :class:`pyspark.sql.types.DataType` or a
243+
:class:`pyspark.sql.types.StringType` or a list of
244+
column names, default is None. The data type string format equals to
245+
:class:`pyspark.sql.types.DataType.simpleString`, except that top level struct type can
246+
omit the ``struct<>`` and atomic types use ``typeName()`` as their format, e.g. use
247+
``byte`` instead of ``tinyint`` for :class:`pyspark.sql.types.ByteType`.
248+
We can also use ``int`` as a short name for :class:`pyspark.sql.types.IntegerType`.
244249
:param samplingRatio: the sample ratio of rows used for inferring
245250
:return: :class:`DataFrame`
246251
247252
.. versionchanged:: 2.0
248-
The schema parameter can be a DataType or a datatype string after 2.0. If it's not a
249-
StructType, it will be wrapped into a StructType and each record will also be wrapped
250-
into a tuple.
253+
The ``schema`` parameter can be a :class:`pyspark.sql.types.DataType` or a
254+
:class:`pyspark.sql.types.StringType` after 2.0.
255+
If it's not a :class:`pyspark.sql.types.StructType`, it will be wrapped into a
256+
:class:`pyspark.sql.types.StructType` and each record will also be wrapped into a tuple.
251257
252258
>>> l = [('Alice', 1)]
253259
>>> sqlContext.createDataFrame(l).collect()

python/pyspark/sql/dataframe.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -196,7 +196,7 @@ def writeStream(self):
196196
@property
197197
@since(1.3)
198198
def schema(self):
199-
"""Returns the schema of this :class:`DataFrame` as a :class:`types.StructType`.
199+
"""Returns the schema of this :class:`DataFrame` as a :class:`pyspark.sql.types.StructType`.
200200
201201
>>> df.schema
202202
StructType(List(StructField(age,IntegerType,true),StructField(name,StringType,true)))

python/pyspark/sql/functions.py

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -142,7 +142,7 @@ def _():
142142
_binary_mathfunctions = {
143143
'atan2': 'Returns the angle theta from the conversion of rectangular coordinates (x, y) to' +
144144
'polar coordinates (r, theta).',
145-
'hypot': 'Computes `sqrt(a^2 + b^2)` without intermediate overflow or underflow.',
145+
'hypot': 'Computes ``sqrt(a^2 + b^2)`` without intermediate overflow or underflow.',
146146
'pow': 'Returns the value of the first argument raised to the power of the second argument.',
147147
}
148148

@@ -958,7 +958,8 @@ def months_between(date1, date2):
958958
@since(1.5)
959959
def to_date(col):
960960
"""
961-
Converts the column of StringType or TimestampType into DateType.
961+
Converts the column of :class:`pyspark.sql.types.StringType` or
962+
:class:`pyspark.sql.types.TimestampType` into :class:`pyspark.sql.types.DateType`.
962963
963964
>>> df = spark.createDataFrame([('1997-02-28 10:30:00',)], ['t'])
964965
>>> df.select(to_date(df.t).alias('date')).collect()
@@ -1074,18 +1075,18 @@ def window(timeColumn, windowDuration, slideDuration=None, startTime=None):
10741075
[12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in
10751076
the order of months are not supported.
10761077
1077-
The time column must be of TimestampType.
1078+
The time column must be of :class:`pyspark.sql.types.TimestampType`.
10781079
10791080
Durations are provided as strings, e.g. '1 second', '1 day 12 hours', '2 minutes'. Valid
10801081
interval strings are 'week', 'day', 'hour', 'minute', 'second', 'millisecond', 'microsecond'.
1081-
If the `slideDuration` is not provided, the windows will be tumbling windows.
1082+
If the ``slideDuration`` is not provided, the windows will be tumbling windows.
10821083
10831084
The startTime is the offset with respect to 1970-01-01 00:00:00 UTC with which to start
10841085
window intervals. For example, in order to have hourly tumbling windows that start 15 minutes
10851086
past the hour, e.g. 12:15-13:15, 13:15-14:15... provide `startTime` as `15 minutes`.
10861087
10871088
The output column will be a struct called 'window' by default with the nested columns 'start'
1088-
and 'end', where 'start' and 'end' will be of `TimestampType`.
1089+
and 'end', where 'start' and 'end' will be of :class:`pyspark.sql.types.TimestampType`.
10891090
10901091
>>> df = spark.createDataFrame([("2016-03-11 09:00:07", 1)]).toDF("date", "val")
10911092
>>> w = df.groupBy(window("date", "5 seconds")).agg(sum("val").alias("sum"))
@@ -1367,7 +1368,7 @@ def locate(substr, str, pos=1):
13671368
could not be found in str.
13681369
13691370
:param substr: a string
1370-
:param str: a Column of StringType
1371+
:param str: a Column of :class:`pyspark.sql.types.StringType`
13711372
:param pos: start position (zero based)
13721373
13731374
>>> df = spark.createDataFrame([('abcd',)], ['s',])
@@ -1506,8 +1507,9 @@ def bin(col):
15061507
@ignore_unicode_prefix
15071508
@since(1.5)
15081509
def hex(col):
1509-
"""Computes hex value of the given column, which could be StringType,
1510-
BinaryType, IntegerType or LongType.
1510+
"""Computes hex value of the given column, which could be :class:`pyspark.sql.types.StringType`,
1511+
:class:`pyspark.sql.types.BinaryType`, :class:`pyspark.sql.types.IntegerType` or
1512+
:class:`pyspark.sql.types.LongType`.
15111513
15121514
>>> spark.createDataFrame([('ABC', 3)], ['a', 'b']).select(hex('a'), hex('b')).collect()
15131515
[Row(hex(a)=u'414243', hex(b)=u'3')]
@@ -1781,6 +1783,9 @@ def udf(f, returnType=StringType()):
17811783
duplicate invocations may be eliminated or the function may even be invoked more times than
17821784
it is present in the query.
17831785
1786+
:param f: python function
1787+
:param returnType: a :class:`pyspark.sql.types.DataType` object
1788+
17841789
>>> from pyspark.sql.types import IntegerType
17851790
>>> slen = udf(lambda s: len(s), IntegerType())
17861791
>>> df.select(slen(df.name).alias('slen')).collect()

python/pyspark/sql/readwriter.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -96,7 +96,7 @@ def schema(self, schema):
9696
By specifying the schema here, the underlying data source can skip the schema
9797
inference step, and thus speed up data loading.
9898
99-
:param schema: a StructType object
99+
:param schema: a :class:`pyspark.sql.types.StructType` object
100100
"""
101101
if not isinstance(schema, StructType):
102102
raise TypeError("schema should be StructType")
@@ -125,7 +125,7 @@ def load(self, path=None, format=None, schema=None, **options):
125125
126126
:param path: optional string or a list of string for file-system backed data sources.
127127
:param format: optional string for format of the data source. Default to 'parquet'.
128-
:param schema: optional :class:`StructType` for the input schema.
128+
:param schema: optional :class:`pyspark.sql.types.StructType` for the input schema.
129129
:param options: all other string options
130130
131131
>>> df = spark.read.load('python/test_support/sql/parquet_partitioned', opt1=True,
@@ -166,7 +166,7 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None,
166166
167167
:param path: string represents path to the JSON dataset,
168168
or RDD of Strings storing JSON objects.
169-
:param schema: an optional :class:`StructType` for the input schema.
169+
:param schema: an optional :class:`pyspark.sql.types.StructType` for the input schema.
170170
:param primitivesAsString: infers all primitive values as a string type. If None is set,
171171
it uses the default value, ``false``.
172172
:param prefersDecimal: infers all floating-point values as a decimal type. If the values
@@ -294,7 +294,7 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non
294294
``inferSchema`` option or specify the schema explicitly using ``schema``.
295295
296296
:param path: string, or list of strings, for input path(s).
297-
:param schema: an optional :class:`StructType` for the input schema.
297+
:param schema: an optional :class:`pyspark.sql.types.StructType` for the input schema.
298298
:param sep: sets the single character as a separator for each field and value.
299299
If None is set, it uses the default value, ``,``.
300300
:param encoding: decodes the CSV files by the given encoding type. If None is set,

python/pyspark/sql/session.py

Lines changed: 23 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -47,7 +47,7 @@ def toDF(self, schema=None, sampleRatio=None):
4747
4848
This is a shorthand for ``spark.createDataFrame(rdd, schema, sampleRatio)``
4949
50-
:param schema: a StructType or list of names of columns
50+
:param schema: a :class:`pyspark.sql.types.StructType` or list of names of columns
5151
:param samplingRatio: the sample ratio of rows used for inferring
5252
:return: a DataFrame
5353
@@ -274,9 +274,9 @@ def udf(self):
274274
@since(2.0)
275275
def range(self, start, end=None, step=1, numPartitions=None):
276276
"""
277-
Create a :class:`DataFrame` with single LongType column named `id`,
278-
containing elements in a range from `start` to `end` (exclusive) with
279-
step value `step`.
277+
Create a :class:`DataFrame` with single :class:`pyspark.sql.types.LongType` column named
278+
``id``, containing elements in a range from ``start`` to ``end`` (exclusive) with
279+
step value ``step``.
280280
281281
:param start: the start value
282282
:param end: the end value (exclusive)
@@ -307,7 +307,7 @@ def _inferSchemaFromList(self, data):
307307
Infer schema from list of Row or tuple.
308308
309309
:param data: list of Row or tuple
310-
:return: StructType
310+
:return: :class:`pyspark.sql.types.StructType`
311311
"""
312312
if not data:
313313
raise ValueError("can not infer schema from empty dataset")
@@ -326,7 +326,7 @@ def _inferSchema(self, rdd, samplingRatio=None):
326326
327327
:param rdd: an RDD of Row or tuple
328328
:param samplingRatio: sampling ratio, or no sampling (default)
329-
:return: StructType
329+
:return: :class:`pyspark.sql.types.StructType`
330330
"""
331331
first = rdd.first()
332332
if not first:
@@ -414,28 +414,33 @@ def createDataFrame(self, data, schema=None, samplingRatio=None):
414414
from ``data``, which should be an RDD of :class:`Row`,
415415
or :class:`namedtuple`, or :class:`dict`.
416416
417-
When ``schema`` is :class:`DataType` or datatype string, it must match the real data, or
418-
exception will be thrown at runtime. If the given schema is not StructType, it will be
419-
wrapped into a StructType as its only field, and the field name will be "value", each record
420-
will also be wrapped into a tuple, which can be converted to row later.
417+
When ``schema`` is :class:`pyspark.sql.types.DataType` or
418+
:class:`pyspark.sql.types.StringType`, it must match the
419+
real data, or an exception will be thrown at runtime. If the given schema is not
420+
:class:`pyspark.sql.types.StructType`, it will be wrapped into a
421+
:class:`pyspark.sql.types.StructType` as its only field, and the field name will be "value",
422+
each record will also be wrapped into a tuple, which can be converted to row later.
421423
422424
If schema inference is needed, ``samplingRatio`` is used to determined the ratio of
423425
rows used for schema inference. The first row will be used if ``samplingRatio`` is ``None``.
424426
425427
:param data: an RDD of any kind of SQL data representation(e.g. row, tuple, int, boolean,
426428
etc.), or :class:`list`, or :class:`pandas.DataFrame`.
427-
:param schema: a :class:`DataType` or a datatype string or a list of column names, default
428-
is None. The data type string format equals to `DataType.simpleString`, except that
429-
top level struct type can omit the `struct<>` and atomic types use `typeName()` as
430-
their format, e.g. use `byte` instead of `tinyint` for ByteType. We can also use `int`
431-
as a short name for IntegerType.
429+
:param schema: a :class:`pyspark.sql.types.DataType` or a
430+
:class:`pyspark.sql.types.StringType` or a list of
431+
column names, default is ``None``. The data type string format equals to
432+
:class:`pyspark.sql.types.DataType.simpleString`, except that top level struct type can
433+
omit the ``struct<>`` and atomic types use ``typeName()`` as their format, e.g. use
434+
``byte`` instead of ``tinyint`` for :class:`pyspark.sql.types.ByteType`. We can also use
435+
``int`` as a short name for ``IntegerType``.
432436
:param samplingRatio: the sample ratio of rows used for inferring
433437
:return: :class:`DataFrame`
434438
435439
.. versionchanged:: 2.0
436-
The schema parameter can be a DataType or a datatype string after 2.0. If it's not a
437-
StructType, it will be wrapped into a StructType and each record will also be wrapped
438-
into a tuple.
440+
The ``schema`` parameter can be a :class:`pyspark.sql.types.DataType` or a
441+
:class:`pyspark.sql.types.StringType` after 2.0. If it's not a
442+
:class:`pyspark.sql.types.StructType`, it will be wrapped into a
443+
:class:`pyspark.sql.types.StructType` and each record will also be wrapped into a tuple.
439444
440445
>>> l = [('Alice', 1)]
441446
>>> spark.createDataFrame(l).collect()

python/pyspark/sql/streaming.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -269,7 +269,7 @@ def schema(self, schema):
269269
270270
.. note:: Experimental.
271271
272-
:param schema: a StructType object
272+
:param schema: a :class:`pyspark.sql.types.StructType` object
273273
274274
>>> s = spark.readStream.schema(sdf_schema)
275275
"""
@@ -310,7 +310,7 @@ def load(self, path=None, format=None, schema=None, **options):
310310
311311
:param path: optional string for file-system backed data sources.
312312
:param format: optional string for format of the data source. Default to 'parquet'.
313-
:param schema: optional :class:`StructType` for the input schema.
313+
:param schema: optional :class:`pyspark.sql.types.StructType` for the input schema.
314314
:param options: all other string options
315315
316316
>>> json_sdf = spark.readStream.format("json")\
@@ -349,7 +349,7 @@ def json(self, path, schema=None, primitivesAsString=None, prefersDecimal=None,
349349
350350
:param path: string represents path to the JSON dataset,
351351
or RDD of Strings storing JSON objects.
352-
:param schema: an optional :class:`StructType` for the input schema.
352+
:param schema: an optional :class:`pyspark.sql.types.StructType` for the input schema.
353353
:param primitivesAsString: infers all primitive values as a string type. If None is set,
354354
it uses the default value, ``false``.
355355
:param prefersDecimal: infers all floating-point values as a decimal type. If the values
@@ -461,7 +461,7 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non
461461
.. note:: Experimental.
462462
463463
:param path: string, or list of strings, for input path(s).
464-
:param schema: an optional :class:`StructType` for the input schema.
464+
:param schema: an optional :class:`pyspark.sql.types.StructType` for the input schema.
465465
:param sep: sets the single character as a separator for each field and value.
466466
If None is set, it uses the default value, ``,``.
467467
:param encoding: decodes the CSV files by the given encoding type. If None is set,

python/pyspark/sql/types.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -786,9 +786,10 @@ def _parse_struct_fields_string(s):
786786
def _parse_datatype_string(s):
787787
"""
788788
Parses the given data type string to a :class:`DataType`. The data type string format equals
789-
to `DataType.simpleString`, except that top level struct type can omit the `struct<>` and
790-
atomic types use `typeName()` as their format, e.g. use `byte` instead of `tinyint` for
791-
ByteType. We can also use `int` as a short name for IntegerType.
789+
to :class:`DataType.simpleString`, except that top level struct type can omit
790+
the ``struct<>`` and atomic types use ``typeName()`` as their format, e.g. use ``byte`` instead
791+
of ``tinyint`` for :class:`ByteType`. We can also use ``int`` as a short name
792+
for :class:`IntegerType`.
792793
793794
>>> _parse_datatype_string("int ")
794795
IntegerType

0 commit comments

Comments
 (0)