Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32017][PYTHON][FOLLOW-UP] Rename HADOOP_VERSION to PYSPARK_HADOOP_VERSION in pip installation option #31028

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions python/docs/source/getting_started/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -48,27 +48,27 @@ If you want to install extra dependencies for a specific component, you can inst

pip install pyspark[sql]

For PySpark with/without a specific Hadoop version, you can install it by using ``HADOOP_VERSION`` environment variables as below:
For PySpark with/without a specific Hadoop version, you can install it by using ``PYSPARK_HADOOP_VERSION`` environment variables as below:

.. code-block:: bash

HADOOP_VERSION=2.7 pip install pyspark
PYSPARK_HADOOP_VERSION=2.7 pip install pyspark

The default distribution uses Hadoop 3.2 and Hive 2.3. If users specify different versions of Hadoop, the pip installation automatically
downloads a different version and use it in PySpark. Downloading it can take a while depending on
the network and the mirror chosen. ``PYSPARK_RELEASE_MIRROR`` can be set to manually choose the mirror for faster downloading.

.. code-block:: bash

PYSPARK_RELEASE_MIRROR=http://mirror.apache-kr.org HADOOP_VERSION=2.7 pip install
PYSPARK_RELEASE_MIRROR=http://mirror.apache-kr.org PYSPARK_HADOOP_VERSION=2.7 pip install

It is recommended to use ``-v`` option in ``pip`` to track the installation and download status.

.. code-block:: bash

HADOOP_VERSION=2.7 pip install pyspark -v
PYSPARK_HADOOP_VERSION=2.7 pip install pyspark -v

Supported values in ``HADOOP_VERSION`` are:
Supported values in ``PYSPARK_HADOOP_VERSION`` are:

- ``without``: Spark pre-built with user-provided Apache Hadoop
- ``2.7``: Spark pre-built for Apache Hadoop 2.7
Expand Down
2 changes: 1 addition & 1 deletion python/pyspark/find_spark_home.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ def is_spark_home(path):
(os.path.isdir(os.path.join(path, "jars")) or
os.path.isdir(os.path.join(path, "assembly"))))

# Spark distribution can be downloaded when HADOOP_VERSION environment variable is set.
# Spark distribution can be downloaded when PYSPARK_HADOOP_VERSION environment variable is set.
# We should look up this directory first, see also SPARK-32017.
spark_dist_dir = "spark-distribution"
paths = [
Expand Down
14 changes: 7 additions & 7 deletions python/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -125,16 +125,16 @@ def run(self):
spark_dist = os.path.join(self.install_lib, "pyspark", "spark-distribution")
rmtree(spark_dist, ignore_errors=True)

if ("HADOOP_VERSION" in os.environ) or ("HIVE_VERSION" in os.environ):
# Note that SPARK_VERSION environment is just a testing purpose.
# HIVE_VERSION environment variable is also internal for now in case
if ("PYSPARK_HADOOP_VERSION" in os.environ) or ("PYSPARK_HIVE_VERSION" in os.environ):
# Note that PYSPARK_VERSION environment is just a testing purpose.
HyukjinKwon marked this conversation as resolved.
Show resolved Hide resolved
# PYSPARK_HIVE_VERSION environment variable is also internal for now in case
# we support another version of Hive in the future.
spark_version, hadoop_version, hive_version = install_module.checked_versions(
os.environ.get("SPARK_VERSION", VERSION).lower(),
os.environ.get("HADOOP_VERSION", install_module.DEFAULT_HADOOP).lower(),
os.environ.get("HIVE_VERSION", install_module.DEFAULT_HIVE).lower())
os.environ.get("PYSPARK_VERSION", VERSION).lower(),
os.environ.get("PYSPARK_HADOOP_VERSION", install_module.DEFAULT_HADOOP).lower(),
os.environ.get("PYSPARK_HIVE_VERSION", install_module.DEFAULT_HIVE).lower())

if ("SPARK_VERSION" not in os.environ and
if ("PYSPARK_VERSION" not in os.environ and
((install_module.DEFAULT_HADOOP, install_module.DEFAULT_HIVE) ==
(hadoop_version, hive_version))):
# Do not download and install if they are same as default.
Expand Down