Description
openedon Apr 20, 2022
Solution to issue cannot be found in the documentation.
- I checked the documentation.
Issue
I have been unable to get conda-forge pyspark working out of the box, and have spent a couple of days figuring out what's going wrong. I am not versed enough to make a PR for myself, nor confident enough that this problem isn't observed by everyone to merit that PR. Regardless, I hope the info I put here is useful to the devs, or at least to people like me who are having trouble getting it working.
My process for installing pyspark locally:
- install miniconda (local user)
- open miniconda promp, and run:
- conda create -n pyspark_env
- conda activate pyspark_env
- conda install -c conda-forge pyspark openjdk
- conda install findspark
- then see steps below required to fix the suite of errors
There are four main issues:
- java is not listed as a dependency for pyspark, which will resolve in a "java not found" error on launching pyspark.
- "conda install openjdk" before/after you install pyspark does the trick.
- winutils.exe is missing from SPARK_HOME (C:\Users\XXXXXXXX\Miniconda3\envs\pyspark_env\Lib\site-packages\pyspark\bin). This results in a WARNING when pyspark is run in shell ("Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries")
- Given the version of hadoop installed with pyspark, I downloaded winutils.exe from here and put it in the directory: https://github.com/cdarlint/winutils/blob/master/hadoop-2.7.3/bin/winutils.exe
- A system environmental variable (not local, despite my miniconda being installed for the local user) had to be created called HADOOP_HOME, and set to the same as SPARK_HOME (obviously, this won't work when switching virtual environments, but you get the idea).
-
findspark is required to link python to spark. Without first running import findspark; findspark.init(), the error is thrown: "Python worker failed to connect back" on some pyspark commands
-
spark version 2.4 (installed with pyspark) has a bug in it that fails when run on windows, resulting in an error ModuleNotFound error for "resource" when some pyspark commands are used.
- the following changes need to be applied: https://github.com/apache/spark/pull/23055/files#diff-17ed18489a956f326ec0fe4040850c5bc9261d4631fb42da4c52891d74a59180\
- apply them to worker.py in SPARK_HOME, and in SPARK_HOME/python/lib/pyspark.zip
I'm happy to elaborate or provide clearer errors/steps as needed.
Installed packages
# packages in environment at C:\Users\XXXXXXXX\Miniconda3\envs\pyspark_env:
#
# Name Version Build Channel
argon2-cffi 20.1.0 py36h2bbff1b_1
async_generator 1.10 py36h28b3542_0
attrs 21.4.0 pyhd3eb1b0_0
backcall 0.2.0 pyhd3eb1b0_0
bleach 4.1.0 pyhd3eb1b0_0
ca-certificates 2022.3.29 haa95532_0
certifi 2021.5.30 py36haa95532_0
cffi 1.14.6 py36h2bbff1b_0
colorama 0.4.4 pyhd3eb1b0_0
decorator 5.1.1 pyhd3eb1b0_0
defusedxml 0.7.1 pyhd3eb1b0_0
entrypoints 0.3 py36_0
findspark 2.0.1 pyhd8ed1ab_0 conda-forge
icu 58.2 ha925a31_3
intel-openmp 2022.0.0 h57928b3_3663 conda-forge
ipykernel 5.3.4 py36h5ca1d4c_0
ipython 7.16.1 py36h5ca1d4c_0
ipython_genutils 0.2.0 pyhd3eb1b0_1
ipywidgets 7.6.5 pyhd3eb1b0_1
jedi 0.17.0 py36_0
jinja2 3.0.3 pyhd3eb1b0_0
jpeg 9d h2bbff1b_0
jsonschema 3.0.2 py36_0
jupyter 1.0.0 py36_7
jupyter_client 7.1.2 pyhd3eb1b0_0
jupyter_console 6.4.3 pyhd3eb1b0_0
jupyter_core 4.8.1 py36haa95532_0
jupyterlab_pygments 0.1.2 py_0
jupyterlab_widgets 1.0.0 pyhd3eb1b0_1
libblas 3.9.0 14_win64_mkl conda-forge
libcblas 3.9.0 14_win64_mkl conda-forge
liblapack 3.9.0 14_win64_mkl conda-forge
libpng 1.6.37 h2a8f88b_0
m2w64-gcc-libgfortran 5.3.0 6
m2w64-gcc-libs 5.3.0 7
m2w64-gcc-libs-core 5.3.0 7
m2w64-gmp 6.1.0 2
m2w64-libwinpthread-git 5.0.0.4634.697f757 2
markupsafe 2.0.1 py36h2bbff1b_0
mistune 0.8.4 py36he774522_0
mkl 2022.0.0 h0e2418a_796 conda-forge
msys2-conda-epoch 20160418 1
nbclient 0.5.3 pyhd3eb1b0_0
nbconvert 6.0.7 py36_0
nbformat 5.1.3 pyhd3eb1b0_0
nest-asyncio 1.5.1 pyhd3eb1b0_0
notebook 6.4.3 py36haa95532_0
numpy 1.19.5 py36h4b40d73_2 conda-forge
openjdk 11.0.13 h2bbff1b_0
openssl 1.1.1n h2bbff1b_0
packaging 21.3 pyhd3eb1b0_0
pandas 0.25.3 py36he350917_0 conda-forge
pandoc 2.12 haa95532_0
pandocfilters 1.5.0 pyhd3eb1b0_0
parso 0.8.3 pyhd3eb1b0_0
pickleshare 0.7.5 pyhd3eb1b0_1003
pip 20.0.2 py36_1 conda-forge
prometheus_client 0.13.1 pyhd3eb1b0_0
prompt-toolkit 3.0.20 pyhd3eb1b0_0
prompt_toolkit 3.0.20 hd3eb1b0_0
py4j 0.10.8.1 py36_0
pycparser 2.21 pyhd3eb1b0_0
pygments 2.11.2 pyhd3eb1b0_0
pyparsing 3.0.4 pyhd3eb1b0_0
pyqt 5.9.2 py36h6538335_2
pyrsistent 0.17.3 py36he774522_0
pyspark 2.4.0 py36_1000 conda-forge
python 3.6.15 h39d44d4_0_cpython conda-forge
python-dateutil 2.8.2 pyhd8ed1ab_0 conda-forge
python_abi 3.6 2_cp36m conda-forge
pytz 2022.1 pyhd8ed1ab_0 conda-forge
pywin32 228 py36hbaba5e8_1
pywinpty 0.5.7 py36_0
pyzmq 22.2.1 py36hd77b12b_1
qt 5.9.7 vc14h73c81de_0
qtconsole 5.2.2 pyhd3eb1b0_0
qtpy 2.0.1 pyhd3eb1b0_0
send2trash 1.8.0 pyhd3eb1b0_1
setuptools 49.6.0 py36ha15d459_3 conda-forge
sip 4.19.8 py36h6538335_0
six 1.16.0 pyh6c4a22f_0 conda-forge
sqlite 3.38.2 h2bbff1b_0
tbb 2021.5.0 h2d74725_1 conda-forge
terminado 0.9.4 py36haa95532_0
testpath 0.5.0 pyhd3eb1b0_0
tornado 6.1 py36h2bbff1b_0
traitlets 4.3.3 py36haa95532_0
ucrt 10.0.20348.0 h57928b3_0 conda-forge
vc 14.2 hb210afc_6 conda-forge
vs2015_runtime 14.29.30037 h902a5da_6 conda-forge
wcwidth 0.2.5 pyhd3eb1b0_0
webencodings 0.5.1 py36_1
wheel 0.37.1 pyhd8ed1ab_0 conda-forge
widgetsnbextension 3.5.1 py36_0
wincertstore 0.2 py36ha15d459_1006 conda-forge
winpty 0.4.3 4
zlib 1.2.12 h8cc25b3_1
Environment info
active environment : pyspark_env
active env location : C:\Users\XXXXXXXX\Miniconda3\envs\pyspark_env
shell level : 2
user config file : C:\Users\XXXXXXXX\.condarc
populated config files :
conda version : 4.12.0
conda-build version : not installed
python version : 3.9.7.final.0
virtual packages : __win=0=0
__archspec=1=x86_64
base environment : C:\Users\XXXXXXXX\Miniconda3 (writable)
conda av data dir : C:\Users\XXXXXXXX\Miniconda3\etc\conda
conda av metadata url : None
channel URLs : https://repo.anaconda.com/pkgs/main/win-64
https://repo.anaconda.com/pkgs/main/noarch
https://repo.anaconda.com/pkgs/r/win-64
https://repo.anaconda.com/pkgs/r/noarch
https://repo.anaconda.com/pkgs/msys2/win-64
https://repo.anaconda.com/pkgs/msys2/noarch
package cache : C:\Users\XXXXXXXX\Miniconda3\pkgs
C:\Users\XXXXXXXX\.conda\pkgs
C:\Users\XXXXXXXX\AppData\Local\conda\conda\pkgs
envs directories : C:\Users\XXXXXXXX\Miniconda3\envs
C:\Users\XXXXXXXX\.conda\envs
C:\Users\XXXXXXXX\AppData\Local\conda\conda\envs
platform : win-64
user-agent : conda/4.12.0 requests/2.27.1 CPython/3.9.7 Windows/10 Windows/10.0.19043
administrator : False
netrc file : None
offline mode : False