Skip to content

[SPARK-35395][DOCS] Move ORC data source options from Python and Scala into a single page #32546

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from

Conversation

itholic
Copy link
Contributor

@itholic itholic commented May 14, 2021

What changes were proposed in this pull request?

This PR proposes move ORC data source options from Python, Scala and Java into a single page.

Why are the changes needed?

So far, the documentation for ORC data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language.

Does this PR introduce any user-facing change?

Yes, the documents will be shown below after this change:

  • "ORC Files" page
    Screen Shot 2021-05-21 at 2 07 14 PM

  • Python
    Screen Shot 2021-05-21 at 2 06 46 PM

  • Scala
    Screen Shot 2021-05-21 at 2 06 09 PM

  • Java
    Screen Shot 2021-05-21 at 2 06 30 PM

How was this patch tested?

Manually build docs and confirm the page.

@SparkQA
Copy link

SparkQA commented May 14, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43067/

@SparkQA
Copy link

SparkQA commented May 14, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43067/

@SparkQA
Copy link

SparkQA commented May 14, 2021

Test build #138548 has finished for PR 32546 at commit 0ce01d0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 17, 2021

Test build #138632 has finished for PR 32546 at commit 005d87d.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43151/

@SparkQA
Copy link

SparkQA commented May 17, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43151/

@SparkQA
Copy link

SparkQA commented May 17, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43155/

@SparkQA
Copy link

SparkQA commented May 17, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43155/

@SparkQA
Copy link

SparkQA commented May 17, 2021

Test build #138635 has finished for PR 32546 at commit aa54b45.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 18, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43175/

@SparkQA
Copy link

SparkQA commented May 18, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43175/

@itholic
Copy link
Contributor Author

itholic commented May 18, 2021

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented May 18, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43182/

@SparkQA
Copy link

SparkQA commented May 18, 2021

Test build #138654 has finished for PR 32546 at commit 3740c52.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class UpdatingSessionsExec(
  • class UpdatingSessionsIterator(

@SparkQA
Copy link

SparkQA commented May 20, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43261/

@SparkQA
Copy link

SparkQA commented May 20, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43272/

@SparkQA
Copy link

SparkQA commented May 20, 2021

Test build #138738 has finished for PR 32546 at commit 0b0e183.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 20, 2021

Test build #138742 has finished for PR 32546 at commit 6358d59.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 20, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43272/

@SparkQA
Copy link

SparkQA commented May 20, 2021

Test build #138750 has finished for PR 32546 at commit 3971dd8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Looks pretty good otherwise. Don't forgot to update Pr description as well. cc @dongjoon-hyun FYI

@SparkQA
Copy link

SparkQA commented May 21, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43300/

@SparkQA
Copy link

SparkQA commented May 21, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43300/

@dongjoon-hyun
Copy link
Member

Thank you for pinging me, @HyukjinKwon .

<td>write</td>
</tr>
</table>
Other generic options can be found in <a href="https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html"> Generic File Source Options</a>.
Copy link
Member

@dongjoon-hyun dongjoon-hyun May 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I know that this is inherited, https://spark.apache.org/docs/latest/ looks fragile to me because it is going to be a broken link when we cut branch-3.2 on July 1st. In branch-3.2, it should point 3.2 document only. Shall we use a relative link instead of /latest/?

Like this PR, we don't know what refactoring happens in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @dongjoon-hyun .
I took a look for that but seems tricky to create a link for each release in Scaladoc ..
I created a JIRA to track it separately here: SPARK-35481.
I will take a separate look if that's fine to you too!

----------------
Extra options
For the extra options, refer to
`Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-orc.html#data-source-option>`_ # noqa
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. Can we have a more robust link here?

* </ul>
* ORC-specific option(s) for reading ORC files can be found in
* <a href=
* "https://spark.apache.org/docs/latest/sql-data-sources-orc.html#data-source-option">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented May 21, 2021

Thank you, @itholic and @HyukjinKwon . The refactoring idea looks good to me. I commented only a technical issue about the link usage. I'll leave this to @HyukjinKwon .

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay looks fine with https://github.com/apache/spark/pull/32546/files#r636625094. Please update Pr description.

@itholic
Copy link
Contributor Author

itholic commented May 21, 2021

Thanks, @HyukjinKwon .
PR description is updated, and also the PR description of #32204, #32161 are updated as well.

@SparkQA
Copy link

SparkQA commented May 21, 2021

Test build #138776 has finished for PR 32546 at commit 043d308.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

<tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr>
<tr>
<td><code>mergeSchema</code></td>
<td>None</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itholic it has the same issue. The default value isn't None but false.

</tr>
<tr>
<td><code>compression</code></td>
<td>None</td>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case when the default value doesn't exist, you can follow https://spark.apache.org/docs/latest/configuration.html#runtime-sql-configuration (none).

* `DataFrameReader`
* `DataFrameWriter`
* `DataStreamReader`
* `DataStreamWriter`
Copy link
Member

@HyukjinKwon HyukjinKwon May 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also mention:

* `OPTIONS` clause at [CREATE TABLE USING DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants