Skip to content

suggested improvements to resolve java, hadoop, python and resource errors #34

Open

Description

Solution to issue cannot be found in the documentation.

  • I checked the documentation.

Issue

I have been unable to get conda-forge pyspark working out of the box, and have spent a couple of days figuring out what's going wrong. I am not versed enough to make a PR for myself, nor confident enough that this problem isn't observed by everyone to merit that PR. Regardless, I hope the info I put here is useful to the devs, or at least to people like me who are having trouble getting it working.

My process for installing pyspark locally:

  1. install miniconda (local user)
  2. open miniconda promp, and run:
  • conda create -n pyspark_env
  • conda activate pyspark_env
  • conda install -c conda-forge pyspark openjdk
  • conda install findspark
  1. then see steps below required to fix the suite of errors

There are four main issues:

  1. java is not listed as a dependency for pyspark, which will resolve in a "java not found" error on launching pyspark.
  • "conda install openjdk" before/after you install pyspark does the trick.
  1. winutils.exe is missing from SPARK_HOME (C:\Users\XXXXXXXX\Miniconda3\envs\pyspark_env\Lib\site-packages\pyspark\bin). This results in a WARNING when pyspark is run in shell ("Failed to locate the winutils binary in the hadoop binary path java.io.IOException: Could not locate executable null\bin\winutils.exe in the Hadoop binaries")
  • Given the version of hadoop installed with pyspark, I downloaded winutils.exe from here and put it in the directory: https://github.com/cdarlint/winutils/blob/master/hadoop-2.7.3/bin/winutils.exe
  • A system environmental variable (not local, despite my miniconda being installed for the local user) had to be created called HADOOP_HOME, and set to the same as SPARK_HOME (obviously, this won't work when switching virtual environments, but you get the idea).
  1. findspark is required to link python to spark. Without first running import findspark; findspark.init(), the error is thrown: "Python worker failed to connect back" on some pyspark commands

  2. spark version 2.4 (installed with pyspark) has a bug in it that fails when run on windows, resulting in an error ModuleNotFound error for "resource" when some pyspark commands are used.

I'm happy to elaborate or provide clearer errors/steps as needed.

Installed packages

# packages in environment at C:\Users\XXXXXXXX\Miniconda3\envs\pyspark_env:
#
# Name                    Version                   Build  Channel
argon2-cffi               20.1.0           py36h2bbff1b_1
async_generator           1.10             py36h28b3542_0
attrs                     21.4.0             pyhd3eb1b0_0
backcall                  0.2.0              pyhd3eb1b0_0
bleach                    4.1.0              pyhd3eb1b0_0
ca-certificates           2022.3.29            haa95532_0
certifi                   2021.5.30        py36haa95532_0
cffi                      1.14.6           py36h2bbff1b_0
colorama                  0.4.4              pyhd3eb1b0_0
decorator                 5.1.1              pyhd3eb1b0_0
defusedxml                0.7.1              pyhd3eb1b0_0
entrypoints               0.3                      py36_0
findspark                 2.0.1              pyhd8ed1ab_0    conda-forge
icu                       58.2                 ha925a31_3
intel-openmp              2022.0.0          h57928b3_3663    conda-forge
ipykernel                 5.3.4            py36h5ca1d4c_0
ipython                   7.16.1           py36h5ca1d4c_0
ipython_genutils          0.2.0              pyhd3eb1b0_1
ipywidgets                7.6.5              pyhd3eb1b0_1
jedi                      0.17.0                   py36_0
jinja2                    3.0.3              pyhd3eb1b0_0
jpeg                      9d                   h2bbff1b_0
jsonschema                3.0.2                    py36_0
jupyter                   1.0.0                    py36_7
jupyter_client            7.1.2              pyhd3eb1b0_0
jupyter_console           6.4.3              pyhd3eb1b0_0
jupyter_core              4.8.1            py36haa95532_0
jupyterlab_pygments       0.1.2                      py_0
jupyterlab_widgets        1.0.0              pyhd3eb1b0_1
libblas                   3.9.0              14_win64_mkl    conda-forge
libcblas                  3.9.0              14_win64_mkl    conda-forge
liblapack                 3.9.0              14_win64_mkl    conda-forge
libpng                    1.6.37               h2a8f88b_0
m2w64-gcc-libgfortran     5.3.0                         6
m2w64-gcc-libs            5.3.0                         7
m2w64-gcc-libs-core       5.3.0                         7
m2w64-gmp                 6.1.0                         2
m2w64-libwinpthread-git   5.0.0.4634.697f757               2
markupsafe                2.0.1            py36h2bbff1b_0
mistune                   0.8.4            py36he774522_0
mkl                       2022.0.0           h0e2418a_796    conda-forge
msys2-conda-epoch         20160418                      1
nbclient                  0.5.3              pyhd3eb1b0_0
nbconvert                 6.0.7                    py36_0
nbformat                  5.1.3              pyhd3eb1b0_0
nest-asyncio              1.5.1              pyhd3eb1b0_0
notebook                  6.4.3            py36haa95532_0
numpy                     1.19.5           py36h4b40d73_2    conda-forge
openjdk                   11.0.13              h2bbff1b_0
openssl                   1.1.1n               h2bbff1b_0
packaging                 21.3               pyhd3eb1b0_0
pandas                    0.25.3           py36he350917_0    conda-forge
pandoc                    2.12                 haa95532_0
pandocfilters             1.5.0              pyhd3eb1b0_0
parso                     0.8.3              pyhd3eb1b0_0
pickleshare               0.7.5           pyhd3eb1b0_1003
pip                       20.0.2                   py36_1    conda-forge
prometheus_client         0.13.1             pyhd3eb1b0_0
prompt-toolkit            3.0.20             pyhd3eb1b0_0
prompt_toolkit            3.0.20               hd3eb1b0_0
py4j                      0.10.8.1                 py36_0
pycparser                 2.21               pyhd3eb1b0_0
pygments                  2.11.2             pyhd3eb1b0_0
pyparsing                 3.0.4              pyhd3eb1b0_0
pyqt                      5.9.2            py36h6538335_2
pyrsistent                0.17.3           py36he774522_0
pyspark                   2.4.0                 py36_1000    conda-forge
python                    3.6.15          h39d44d4_0_cpython    conda-forge
python-dateutil           2.8.2              pyhd8ed1ab_0    conda-forge
python_abi                3.6                     2_cp36m    conda-forge
pytz                      2022.1             pyhd8ed1ab_0    conda-forge
pywin32                   228              py36hbaba5e8_1
pywinpty                  0.5.7                    py36_0
pyzmq                     22.2.1           py36hd77b12b_1
qt                        5.9.7            vc14h73c81de_0
qtconsole                 5.2.2              pyhd3eb1b0_0
qtpy                      2.0.1              pyhd3eb1b0_0
send2trash                1.8.0              pyhd3eb1b0_1
setuptools                49.6.0           py36ha15d459_3    conda-forge
sip                       4.19.8           py36h6538335_0
six                       1.16.0             pyh6c4a22f_0    conda-forge
sqlite                    3.38.2               h2bbff1b_0
tbb                       2021.5.0             h2d74725_1    conda-forge
terminado                 0.9.4            py36haa95532_0
testpath                  0.5.0              pyhd3eb1b0_0
tornado                   6.1              py36h2bbff1b_0
traitlets                 4.3.3            py36haa95532_0
ucrt                      10.0.20348.0         h57928b3_0    conda-forge
vc                        14.2                 hb210afc_6    conda-forge
vs2015_runtime            14.29.30037          h902a5da_6    conda-forge
wcwidth                   0.2.5              pyhd3eb1b0_0
webencodings              0.5.1                    py36_1
wheel                     0.37.1             pyhd8ed1ab_0    conda-forge
widgetsnbextension        3.5.1                    py36_0
wincertstore              0.2             py36ha15d459_1006    conda-forge
winpty                    0.4.3                         4
zlib                      1.2.12               h8cc25b3_1

Environment info

active environment : pyspark_env
    active env location : C:\Users\XXXXXXXX\Miniconda3\envs\pyspark_env
            shell level : 2
       user config file : C:\Users\XXXXXXXX\.condarc
 populated config files :
          conda version : 4.12.0
    conda-build version : not installed
         python version : 3.9.7.final.0
       virtual packages : __win=0=0
                          __archspec=1=x86_64
       base environment : C:\Users\XXXXXXXX\Miniconda3  (writable)
      conda av data dir : C:\Users\XXXXXXXX\Miniconda3\etc\conda
  conda av metadata url : None
           channel URLs : https://repo.anaconda.com/pkgs/main/win-64
                          https://repo.anaconda.com/pkgs/main/noarch
                          https://repo.anaconda.com/pkgs/r/win-64
                          https://repo.anaconda.com/pkgs/r/noarch
                          https://repo.anaconda.com/pkgs/msys2/win-64
                          https://repo.anaconda.com/pkgs/msys2/noarch
          package cache : C:\Users\XXXXXXXX\Miniconda3\pkgs
                          C:\Users\XXXXXXXX\.conda\pkgs
                          C:\Users\XXXXXXXX\AppData\Local\conda\conda\pkgs
       envs directories : C:\Users\XXXXXXXX\Miniconda3\envs
                          C:\Users\XXXXXXXX\.conda\envs
                          C:\Users\XXXXXXXX\AppData\Local\conda\conda\envs
               platform : win-64
             user-agent : conda/4.12.0 requests/2.27.1 CPython/3.9.7 Windows/10 Windows/10.0.19043
          administrator : False
             netrc file : None
           offline mode : False
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions