This notebook is a Spark and Python learner's to perform data analysis on some real-world data set.
In this notebook, I am capriciously using Spark, Pandas, Matplotlib, Seaborn without any meaningful distinction of purpose. The point is:
- Perform data reading, transforming, and querying using Apache Spark
- Visualize using existing Python libraries. Matplotlib will remain when I know to do with it all that I'm currenly using Seaborn for.
- Where interoperation between Spark and Matplotlib is a hindrance, I use Pandas and Numpy.
This will be evolutionary and I hope that a few weeks from now, it will look much better.
This dataset can be found either at the original dataset website: https://data.cityofchicago.org/Public-Safety/Chicago-Police-Department-Illinois-Uniform-Crime-R/c7ck-438e or on Kaggle, here: https://www.kaggle.com/djonafegnem/chicago-crime-data-analysis (I used this one as it came in a compressed archive)
I wrote this on Apache Spark 2.3.0. The entire notebook was executed on a single machine using the pyspark
shell without problems.
Spark can be downloaded from https://spark.apache.org/downloads.html To run this notebook, one certainly also needs Pandas, Numpy, Matplotlib and other libraries installed. For me, all that was sorted out using Anaconda: https://www.anaconda.com/download/
Here are some important parameters:
- Executor count: 4
- Executor Memory: 4G
- Driver Memory: 8G
This may not be necessary, but some data frames are being cached and performance degrades remarkably when the percentage of cached RDDs drops.
So, the command with which the notebook was launched is:
pyspark --driver-memory 8g --executor-memory 4g --master local[4]