Skip to content

[SPARK-32204][SPARK-32182][DOCS] Add a quickstart page with Binder integration in PySpark documentation #29491

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions .github/workflows/build_and_test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -226,7 +226,7 @@ jobs:
run: |
# TODO(SPARK-32407): Sphinx 3.1+ does not correctly index nested classes.
# See also https://github.com/sphinx-doc/sphinx/issues/7551.
pip3 install flake8 'sphinx<3.1.0' numpy pydata_sphinx_theme
pip3 install flake8 'sphinx<3.1.0' numpy pydata_sphinx_theme ipython nbsphinx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a requirements.txt somewhere instead?

- name: Install R 4.0
run: |
sudo sh -c "echo 'deb https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/' >> /etc/apt/sources.list"
Expand All @@ -245,10 +245,11 @@ jobs:
ruby-version: 2.7
- name: Install dependencies for documentation generation
run: |
# pandoc is required to generate PySpark APIs as well in nbsphinx.
sudo apt-get install -y libcurl4-openssl-dev pandoc
# TODO(SPARK-32407): Sphinx 3.1+ does not correctly index nested classes.
# See also https://github.com/sphinx-doc/sphinx/issues/7551.
pip install 'sphinx<3.1.0' mkdocs numpy pydata_sphinx_theme
pip install 'sphinx<3.1.0' mkdocs numpy pydata_sphinx_theme ipython nbsphinx
gem install jekyll jekyll-redirect-from rouge
sudo Rscript -e "install.packages(c('devtools', 'testthat', 'knitr', 'rmarkdown', 'roxygen2'), repos='https://cloud.r-project.org/')"
- name: Scala linter
Expand Down
1 change: 1 addition & 0 deletions binder/apt.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
openjdk-8-jre
24 changes: 24 additions & 0 deletions binder/postBuild
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
#!/bin/bash

#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

# This file is used for Binder integration to install PySpark available in
# Jupyter notebook.

VERSION=$(python -c "exec(open('python/pyspark/version.py').read()); print(__version__)")
pip install "pyspark[sql,ml,mllib]<=$VERSION"
3 changes: 2 additions & 1 deletion dev/create-release/spark-rm/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ ARG APT_INSTALL="apt-get install --no-install-recommends -y"
# TODO(SPARK-32407): Sphinx 3.1+ does not correctly index nested classes.
# See also https://github.com/sphinx-doc/sphinx/issues/7551.
# We should use the latest Sphinx version once this is fixed.
ARG PIP_PKGS="sphinx==3.0.4 mkdocs==1.0.4 numpy==1.18.1 pydata_sphinx_theme==0.3.1"
ARG PIP_PKGS="sphinx==3.0.4 mkdocs==1.0.4 numpy==1.18.1 pydata_sphinx_theme==0.3.1 ipython==7.16.1 nbsphinx==0.7.1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes me think we should have a shared requirements.txt somewhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We tried that in #27928 but couldn't get buy-in from a release manager.

Also discussed briefly here: #29491 (comment)

Copy link
Member Author

@HyukjinKwon HyukjinKwon Aug 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's run it separately - there's already a JIRA for it SPARK-31167 as @nchammas pointed out. It's a bit complicated than I thought.

ARG GEM_PKGS="jekyll:4.0.0 jekyll-redirect-from:0.16.0 rouge:3.15.0"

# Install extra needed repos and refresh.
Expand Down Expand Up @@ -75,6 +75,7 @@ RUN apt-get clean && apt-get update && $APT_INSTALL gnupg ca-certificates && \
pip3 install $PIP_PKGS && \
# Install R packages and dependencies used when building.
# R depends on pandoc*, libssl (which are installed above).
# Note that PySpark doc generation also needs pandoc due to nbsphinx
$APT_INSTALL r-base r-base-dev && \
$APT_INSTALL texlive-latex-base texlive texlive-fonts-extra texinfo qpdf && \
Rscript -e "install.packages(c('curl', 'xml2', 'httr', 'devtools', 'testthat', 'knitr', 'rmarkdown', 'roxygen2', 'e1071', 'survival'), repos='https://cloud.r-project.org/')" && \
Expand Down
16 changes: 16 additions & 0 deletions dev/lint-python
Original file line number Diff line number Diff line change
Expand Up @@ -196,6 +196,22 @@ function sphinx_test {
return
fi

# TODO(SPARK-32666): Install nbsphinx in Jenkins machines
PYTHON_HAS_NBSPHINX=$("$PYTHON_EXECUTABLE" -c 'import importlib.util; print(importlib.util.find_spec("nbsphinx") is not None)')
if [[ "$PYTHON_HAS_NBSPHINX" == "False" ]]; then
echo "$PYTHON_EXECUTABLE does not have nbsphinx installed. Skipping Sphinx build for now."
echo
return
fi

# TODO(SPARK-32666): Install ipython in Jenkins machines
PYTHON_HAS_IPYTHON=$("$PYTHON_EXECUTABLE" -c 'import importlib.util; print(importlib.util.find_spec("ipython") is not None)')
if [[ "$PYTHON_HAS_IPYTHON" == "False" ]]; then
echo "$PYTHON_EXECUTABLE does not have ipython installed. Skipping Sphinx build for now."
echo
return
fi

echo "starting $SPHINX_BUILD tests..."
pushd python/docs &> /dev/null
make clean &> /dev/null
Expand Down
2 changes: 2 additions & 0 deletions dev/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,5 @@ PyGithub==1.26.0
Unidecode==0.04.19
sphinx
pydata_sphinx_theme
ipython
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there other implications to this, like, our python packaging now requires ipython when installing pyspark? (sorry ignorant question)

Copy link
Member Author

@HyukjinKwon HyukjinKwon Aug 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, it means nothing special and this file is used nowhere within Spark. It's just for some dev people but I doubt if this is actually often used. We should leverage this file and standardize the dependencies somehow like @nchammas and @dongjoon-hyun tried before but .. I currently don't have a good idea about how to handle it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The packages that are being installed, when installing pyspark are defined in the setup.py: https://github.com/apache/spark/blob/master/python/setup.py#L206

The defacto way is to add an extras_require called dev where you add the development requirements. Preferably with a version attached to it, so you know that Jenkins and the local dev environments are the same.

Happy to migrate this to the setup.py if you like.

nbsphinx
2 changes: 1 addition & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ See also https://github.com/sphinx-doc/sphinx/issues/7551.
-->

```sh
$ sudo pip install 'sphinx<3.1.0' mkdocs numpy pydata_sphinx_theme
$ sudo pip install 'sphinx<3.1.0' mkdocs numpy pydata_sphinx_theme ipython nbsphinx
```

## Generating the Documentation HTML
Expand Down
14 changes: 13 additions & 1 deletion python/docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,20 @@
'sphinx.ext.viewcode',
'sphinx.ext.mathjax',
'sphinx.ext.autosummary',
'nbsphinx', # Converts Jupyter Notebook to reStructuredText files for Sphinx.
# For ipython directive in reStructuredText files. It is generated by the notebook.
'IPython.sphinxext.ipython_console_highlighting'
]

# Links used globally in the RST files.
# These are defined here to allow link substitutions dynamically.
rst_epilog = """
.. |binder| replace:: Live Notebook
.. _binder: https://mybinder.org/v2/gh/apache/spark/{0}?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb
.. |examples| replace:: Examples
.. _examples: https://github.com/apache/spark/tree/{0}/examples/src/main/python
""".format(os.environ.get("RELEASE_TAG", "master"))

# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']

Expand Down Expand Up @@ -84,7 +96,7 @@

# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
exclude_patterns = ['_build']
exclude_patterns = ['_build', '.DS_Store', '**.ipynb_checkpoints']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to include the .DS_Store in here? Feels like nasty to add OS-specific files here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an excluding pattern :D.


# The reST default role (used for this markup: `text`) to use for all
# documents.
Expand Down
4 changes: 4 additions & 0 deletions python/docs/source/getting_started/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,7 @@
Getting Started
===============

.. toctree::
:maxdepth: 2

quickstart
Loading