Skip to content

Commit 0fe65b5

Browse files
itholicHyukjinKwon
authored andcommitted
[SPARK-35395][DOCS] Move ORC data source options from Python and Scala into a single page
### What changes were proposed in this pull request? This PR proposes move ORC data source options from Python, Scala and Java into a single page. ### Why are the changes needed? So far, the documentation for ORC data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language. ### Does this PR introduce _any_ user-facing change? Yes, the documents will be shown below after this change: - "ORC Files" page ![Screen Shot 2021-05-21 at 2 07 14 PM](https://user-images.githubusercontent.com/44108233/119085078-f4564d00-ba3d-11eb-8990-3ba031d809da.png) - Python ![Screen Shot 2021-05-21 at 2 06 46 PM](https://user-images.githubusercontent.com/44108233/119085097-00daa580-ba3e-11eb-8017-ac5a95a7c053.png) - Scala ![Screen Shot 2021-05-21 at 2 06 09 PM](https://user-images.githubusercontent.com/44108233/119085135-164fcf80-ba3e-11eb-9cac-78dded523f38.png) - Java ![Screen Shot 2021-05-21 at 2 06 30 PM](https://user-images.githubusercontent.com/44108233/119085125-118b1b80-ba3e-11eb-9434-f26612d7da13.png) ### How was this patch tested? Manually build docs and confirm the page. Closes #32546 from itholic/SPARK-35395. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
1 parent 34284c0 commit 0fe65b5

File tree

6 files changed

+59
-75
lines changed

6 files changed

+59
-75
lines changed

docs/sql-data-sources-orc.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -172,3 +172,29 @@ When reading from Hive metastore ORC tables and inserting to Hive metastore ORC
172172
<td>2.0.0</td>
173173
</tr>
174174
</table>
175+
176+
## Data Source Option
177+
178+
Data source options of ORC can be set via:
179+
* the `.option`/`.options` methods of
180+
* `DataFrameReader`
181+
* `DataFrameWriter`
182+
* `DataStreamReader`
183+
* `DataStreamWriter`
184+
185+
<table class="table">
186+
<tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr>
187+
<tr>
188+
<td><code>mergeSchema</code></td>
189+
<td>None</td>
190+
<td>sets whether we should merge schemas collected from all ORC part-files. This will override <code>spark.sql.orc.mergeSchema</code>. The default value is specified in <code>spark.sql.orc.mergeSchema</code>.</td>
191+
<td>read</td>
192+
</tr>
193+
<tr>
194+
<td><code>compression</code></td>
195+
<td>None</td>
196+
<td>compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, snappy, zlib, lzo, and zstd). This will override <code>orc.compress</code> and <code>spark.sql.orc.compression.codec</code>. If None is set, it uses the value specified in <code>spark.sql.orc.compression.codec</code>.</td>
197+
<td>write</td>
198+
</tr>
199+
</table>
200+
Other generic options can be found in <a href="https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html"> Generic File Source Options</a>.

python/pyspark/sql/readwriter.py

Lines changed: 13 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -793,28 +793,13 @@ def orc(self, path, mergeSchema=None, pathGlobFilter=None, recursiveFileLookup=N
793793
Parameters
794794
----------
795795
path : str or list
796-
mergeSchema : str or bool, optional
797-
sets whether we should merge schemas collected from all
798-
ORC part-files. This will override ``spark.sql.orc.mergeSchema``.
799-
The default value is specified in ``spark.sql.orc.mergeSchema``.
800-
pathGlobFilter : str or bool
801-
an optional glob pattern to only include files with paths matching
802-
the pattern. The syntax follows `org.apache.hadoop.fs.GlobFilter`.
803-
It does not change the behavior of
804-
`partition discovery <https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery>`_. # noqa
805-
recursiveFileLookup : str or bool
806-
recursively scan a directory for files. Using this option
807-
disables
808-
`partition discovery <https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery>`_. # noqa
809796
810-
modification times occurring before the specified time. The provided timestamp
811-
must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
812-
modifiedBefore : an optional timestamp to only include files with
813-
modification times occurring before the specified time. The provided timestamp
814-
must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
815-
modifiedAfter : an optional timestamp to only include files with
816-
modification times occurring after the specified time. The provided timestamp
817-
must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
797+
Other Parameters
798+
----------------
799+
Extra options
800+
For the extra options, refer to
801+
`Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-orc.html#data-source-option>`_ # noqa
802+
in the version you use.
818803
819804
Examples
820805
--------
@@ -1417,12 +1402,13 @@ def orc(self, path, mode=None, partitionBy=None, compression=None):
14171402
exists.
14181403
partitionBy : str or list, optional
14191404
names of partitioning columns
1420-
compression : str, optional
1421-
compression codec to use when saving to file. This can be one of the
1422-
known case-insensitive shorten names (none, snappy, zlib, lzo, and zstd).
1423-
This will override ``orc.compress`` and
1424-
``spark.sql.orc.compression.codec``. If None is set, it uses the value
1425-
specified in ``spark.sql.orc.compression.codec``.
1405+
1406+
Other Parameters
1407+
----------------
1408+
Extra options
1409+
For the extra options, refer to
1410+
`Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-orc.html#data-source-option>`_ # noqa
1411+
in the version you use.
14261412
14271413
Examples
14281414
--------

python/pyspark/sql/streaming.py

Lines changed: 6 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -637,20 +637,12 @@ def orc(self, path, mergeSchema=None, pathGlobFilter=None, recursiveFileLookup=N
637637
638638
.. versionadded:: 2.3.0
639639
640-
Parameters
641-
----------
642-
mergeSchema : str or bool, optional
643-
sets whether we should merge schemas collected from all
644-
ORC part-files. This will override ``spark.sql.orc.mergeSchema``.
645-
The default value is specified in ``spark.sql.orc.mergeSchema``.
646-
pathGlobFilter : str or bool, optional
647-
an optional glob pattern to only include files with paths matching
648-
the pattern. The syntax follows `org.apache.hadoop.fs.GlobFilter`.
649-
It does not change the behavior of `partition discovery`_.
650-
recursiveFileLookup : str or bool, optional
651-
recursively scan a directory for files. Using this option
652-
disables
653-
`partition discovery <https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery>`_. # noqa
640+
Other Parameters
641+
----------------
642+
Extra options
643+
For the extra options, refer to
644+
`Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-orc.html#data-source-option>`_ # noqa
645+
in the version you use.
654646
655647
Examples
656648
--------

sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala

Lines changed: 4 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -874,23 +874,10 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
874874
/**
875875
* Loads ORC files and returns the result as a `DataFrame`.
876876
*
877-
* You can set the following ORC-specific option(s) for reading ORC files:
878-
* <ul>
879-
* <li>`mergeSchema` (default is the value specified in `spark.sql.orc.mergeSchema`): sets whether
880-
* we should merge schemas collected from all ORC part-files. This will override
881-
* `spark.sql.orc.mergeSchema`.</li>
882-
* <li>`pathGlobFilter`: an optional glob pattern to only include files with paths matching
883-
* the pattern. The syntax follows <code>org.apache.hadoop.fs.GlobFilter</code>.
884-
* It does not change the behavior of partition discovery.</li>
885-
* <li>`modifiedBefore` (batch only): an optional timestamp to only include files with
886-
* modification times occurring before the specified Time. The provided timestamp
887-
* must be in the following form: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)</li>
888-
* <li>`modifiedAfter` (batch only): an optional timestamp to only include files with
889-
* modification times occurring after the specified Time. The provided timestamp
890-
* must be in the following form: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)</li>
891-
* <li>`recursiveFileLookup`: recursively scan a directory for files. Using this option
892-
* disables partition discovery</li>
893-
* </ul>
877+
* ORC-specific option(s) for reading ORC files can be found in
878+
* <a href=
879+
* "https://spark.apache.org/docs/latest/sql-data-sources-orc.html#data-source-option">
880+
* Data Source Option</a> in the version you use.
894881
*
895882
* @param paths input paths
896883
* @since 2.0.0

sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala

Lines changed: 4 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -881,14 +881,10 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) {
881881
* format("orc").save(path)
882882
* }}}
883883
*
884-
* You can set the following ORC-specific option(s) for writing ORC files:
885-
* <ul>
886-
* <li>`compression` (default is the value specified in `spark.sql.orc.compression.codec`):
887-
* compression codec to use when saving to file. This can be one of the known case-insensitive
888-
* shorten names(`none`, `snappy`, `zlib`, `lzo`, and `zstd`). This will override
889-
* `orc.compress` and `spark.sql.orc.compression.codec`. If `orc.compress` is given,
890-
* it overrides `spark.sql.orc.compression.codec`.</li>
891-
* </ul>
884+
* ORC-specific option(s) for writing ORC files can be found in
885+
* <a href=
886+
* "https://spark.apache.org/docs/latest/sql-data-sources-orc.html#data-source-option">
887+
* Data Source Option</a> in the version you use.
892888
*
893889
* @since 1.5.0
894890
*/

sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala

Lines changed: 6 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -453,20 +453,17 @@ final class DataStreamReader private[sql](sparkSession: SparkSession) extends Lo
453453
/**
454454
* Loads a ORC file stream, returning the result as a `DataFrame`.
455455
*
456-
* You can set the following ORC-specific option(s) for reading ORC files:
456+
* You can set the following option(s):
457457
* <ul>
458458
* <li>`maxFilesPerTrigger` (default: no max limit): sets the maximum number of new files to be
459459
* considered in every trigger.</li>
460-
* <li>`mergeSchema` (default is the value specified in `spark.sql.orc.mergeSchema`): sets whether
461-
* we should merge schemas collected from all ORC part-files. This will override
462-
* `spark.sql.orc.mergeSchema`.</li>
463-
* <li>`pathGlobFilter`: an optional glob pattern to only include files with paths matching
464-
* the pattern. The syntax follows <code>org.apache.hadoop.fs.GlobFilter</code>.
465-
* It does not change the behavior of partition discovery.</li>
466-
* <li>`recursiveFileLookup`: recursively scan a directory for files. Using this option
467-
* disables partition discovery</li>
468460
* </ul>
469461
*
462+
* ORC-specific option(s) for reading ORC file stream can be found in
463+
* <a href=
464+
* "https://spark.apache.org/docs/latest/sql-data-sources-orc.html#data-source-option">
465+
* Data Source Option</a> in the version you use.
466+
*
470467
* @since 2.3.0
471468
*/
472469
def orc(path: String): DataFrame = {

0 commit comments

Comments
 (0)