Chicago crime dataset analysis

This notebook is a Spark and Python learner's to perform data analysis on some real-world data set.

In this notebook, I am capriciously using Spark, Pandas, Matplotlib, Seaborn without any meaningful distinction of purpose. The point is:

Perform data reading, transforming, and querying using Apache Spark
Visualize using existing Python libraries. Matplotlib will remain when I know to do with it all that I'm currenly using Seaborn for.
Where interoperation between Spark and Matplotlib is a hindrance, I use Pandas and Numpy.

This will be evolutionary and I hope that a few weeks from now, it will look much better.

Where to find the data?

I wrote this on Apache Spark 2.3.0. The entire notebook was executed on a single machine using the pyspark shell without problems.

Spark can be downloaded from https://spark.apache.org/downloads.html To run this notebook, one certainly also needs Pandas, Numpy, Matplotlib and other libraries installed. For me, all that was sorted out using Anaconda: https://www.anaconda.com/download/

Here are some important parameters:

This may not be necessary, but some data frames are being cached and performance degrades remarkably when the percentage of cached RDDs drops.

So, the command with which the notebook was launched is:

pyspark --driver-memory 8g --executor-memory 4g --master local[4]