Skip to content

[SPARK-35509][DOCS] Move text data source options from Python and Scala into a single page #32660

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 33 additions & 2 deletions docs/sql-data-sources-text.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,6 @@ license: |

Spark SQL provides `spark.read().text("file_name")` to read a file or directory of text files into a Spark DataFrame, and `dataframe.write().text("path")` to write to a text file. When reading a text file, each line becomes each row that has string "value" column by default. The line separator can be changed as shown in the example below. The `option()` function can be used to customize the behavior of reading or writing, such as controlling behavior of the line separator, compression, and so on.

<!--TODO(SPARK-34491): add `option()` document reference-->

<div class="codetabs">

<div data-lang="scala" markdown="1">
Expand All @@ -38,3 +36,36 @@ Spark SQL provides `spark.read().text("file_name")` to read a file or directory
</div>

</div>

## Data Source Option

Data source options of text can be set via:
* the `.option`/`.options` methods of
* `DataFrameReader`
* `DataFrameWriter`
* `DataStreamReader`
* `DataStreamWriter`
* `OPTIONS` clause at [CREATE TABLE USING DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed it here. This is not in "the .option/.options methods of". Can you make the level down?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. let me fix it in #32658


<table class="table">
<tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr>
<tr>
<td><code>wholetext</code></td>
<td>false</td>
<td>If true, read each file from input path(s) as a single row.</td>
<td>read</td>
</tr>
<tr>
<td><code>lineSep</code></td>
<td><code>\r</code>, <code>\r\n</code>, <code>\n</code> for reading, <code>\n</code> for writing</td>
<td>Defines the line separator that should be used for reading or writing.</td>
<td>read/write</td>
</tr>
<tr>
<td><code>compression</code></td>
<td>(none)</td>
<td>Compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy and deflate).</td>
<td>write</td>
</tr>
</table>
Other generic options can be found in <a href="https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html"> Generic File Source Options</a>.
41 changes: 13 additions & 28 deletions python/pyspark/sql/readwriter.py
Original file line number Diff line number Diff line change
Expand Up @@ -313,28 +313,13 @@ def text(self, paths, wholetext=False, lineSep=None, pathGlobFilter=None,
----------
paths : str or list
string, or list of strings, for input path(s).
wholetext : str or bool, optional
if true, read each file from input path(s) as a single row.
lineSep : str, optional
defines the line separator that should be used for parsing. If None is
set, it covers all ``\\r``, ``\\r\\n`` and ``\\n``.
pathGlobFilter : str or bool, optional
an optional glob pattern to only include files with paths matching
the pattern. The syntax follows `org.apache.hadoop.fs.GlobFilter`.
It does not change the behavior of
`partition discovery <https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery>`_. # noqa
recursiveFileLookup : str or bool, optional
recursively scan a directory for files. Using this option disables
`partition discovery <https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery>`_. # noqa

modification times occurring before the specified time. The provided timestamp
must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
modifiedBefore (batch only) : an optional timestamp to only include files with
modification times occurring before the specified time. The provided timestamp
must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
modifiedAfter (batch only) : an optional timestamp to only include files with
modification times occurring after the specified time. The provided timestamp
must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
Other Parameters
----------------
Extra options
For the extra options, refer to
`Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-text.html#data-source-option>`_ # noqa
in the version you use.

Examples
--------
Expand Down Expand Up @@ -1038,13 +1023,13 @@ def text(self, path, compression=None, lineSep=None):
----------
path : str
the path in any Hadoop supported file system
compression : str, optional
compression codec to use when saving to file. This can be one of the
known case-insensitive shorten names (none, bzip2, gzip, lz4,
snappy and deflate).
lineSep : str, optional
defines the line separator that should be used for writing. If None is
set, it uses the default value, ``\\n``.

Other Parameters
----------------
Extra options
For the extra options, refer to
`Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-text.html#data-source-option>`_ # noqa
in the version you use.

The DataFrame must have only one column that is of string type.
Each row becomes a new line in the output file.
Expand Down
20 changes: 7 additions & 13 deletions python/pyspark/sql/streaming.py
Original file line number Diff line number Diff line change
Expand Up @@ -593,19 +593,13 @@ def text(self, path, wholetext=False, lineSep=None, pathGlobFilter=None,
----------
paths : str or list
string, or list of strings, for input path(s).
wholetext : str or bool, optional
if true, read each file from input path(s) as a single row.
lineSep : str, optional
defines the line separator that should be used for parsing. If None is
set, it covers all ``\\r``, ``\\r\\n`` and ``\\n``.
pathGlobFilter : str or bool, optional
an optional glob pattern to only include files with paths matching
the pattern. The syntax follows `org.apache.hadoop.fs.GlobFilter`.
It does not change the behavior of `partition discovery`_.
recursiveFileLookup : str or bool, optional
recursively scan a directory for files. Using this option
disables
`partition discovery <https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery>`_. # noqa

Other Parameters
----------------
Extra options
For the extra options, refer to
`Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-text.html#data-source-option>`_ # noqa
in the version you use.

Notes
-----
Expand Down
21 changes: 3 additions & 18 deletions sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala
Original file line number Diff line number Diff line change
Expand Up @@ -773,24 +773,9 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
* spark.read().text("/path/to/spark/README.md")
* }}}
*
* You can set the following text-specific option(s) for reading text files:
* <ul>
* <li>`wholetext` (default `false`): If true, read a file as a single row and not split by "\n".
* </li>
* <li>`lineSep` (default covers all `\r`, `\r\n` and `\n`): defines the line separator
* that should be used for parsing.</li>
* <li>`pathGlobFilter`: an optional glob pattern to only include files with paths matching
* the pattern. The syntax follows <code>org.apache.hadoop.fs.GlobFilter</code>.
* It does not change the behavior of partition discovery.</li>
* <li>`modifiedBefore` (batch only): an optional timestamp to only include files with
* modification times occurring before the specified Time. The provided timestamp
* must be in the following form: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)</li>
* <li>`modifiedAfter` (batch only): an optional timestamp to only include files with
* modification times occurring after the specified Time. The provided timestamp
* must be in the following form: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)</li>
* <li>`recursiveFileLookup`: recursively scan a directory for files. Using this option
* disables partition discovery</li>
* </ul>
* You can find the text-specific options for reading text files in
* <a href="https://spark.apache.org/docs/latest/sql-data-sources-text.html#data-source-option">
* Data Source Option</a> in the version you use.
*
* @param paths input paths
* @since 1.6.0
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -833,13 +833,9 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) {
* }}}
* The text files will be encoded as UTF-8.
*
* You can set the following option(s) for writing text files:
* <ul>
* <li>`compression` (default `null`): compression codec to use when saving to file. This can be
* one of the known case-insensitive shorten names (`none`, `bzip2`, `gzip`, `lz4`,
* `snappy` and `deflate`). </li>
* <li>`lineSep` (default `\n`): defines the line separator that should be used for writing.</li>
* </ul>
* You can find the text-specific options for writing text files in
* <a href="https://spark.apache.org/docs/latest/sql-data-sources-text.html#data-source-option">
* Data Source Option</a> in the version you use.
*
* @since 1.6.0
*/
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -413,21 +413,16 @@ final class DataStreamReader private[sql](sparkSession: SparkSession) extends Lo
* spark.readStream().text("/path/to/directory/")
* }}}
*
* You can set the following text-specific options to deal with text files:
* You can set the following option(s):
* <ul>
* <li>`maxFilesPerTrigger` (default: no max limit): sets the maximum number of new files to be
* considered in every trigger.</li>
* <li>`wholetext` (default `false`): If true, read a file as a single row and not split by "\n".
* </li>
* <li>`lineSep` (default covers all `\r`, `\r\n` and `\n`): defines the line separator
* that should be used for parsing.</li>
* <li>`pathGlobFilter`: an optional glob pattern to only include files with paths matching
* the pattern. The syntax follows <code>org.apache.hadoop.fs.GlobFilter</code>.
* It does not change the behavior of partition discovery.</li>
* <li>`recursiveFileLookup`: recursively scan a directory for files. Using this option
* disables partition discovery</li>
* </ul>
*
* You can find the text-specific options for reading text files in
* <a href="https://spark.apache.org/docs/latest/sql-data-sources-text.html#data-source-option">
* Data Source Option</a> in the version you use.
*
* @since 2.0.0
*/
def text(path: String): DataFrame = format("text").load(path)
Expand Down