Using SparkR
Problem Statement
Big data analytics allows you to analyse data at scale. It has applications in almost every industry in the world. Let’s consider an unconventional application that you wouldn’t ordinarily encounter.
New York City is a thriving metropolis. Just like most other metros that size, one of the biggest problems its citizens face is parking. The classic combination of a huge number of cars and a cramped geography is the exact recipe that leads to a huge number of parking tickets.
In an attempt to scientifically analyse this phenomenon, the NYC Police Department has collected data for parking tickets. Out of these, the data files from 2014 to 2017 are publicly available on Kaggle. We will try and perform some exploratory analysis on this data. Spark will allow us to analyse the full files at high speeds, as opposed to taking a series of random samples that will approximate the population.
For the scope of this analysis, we wish to compare the phenomenon related to parking tickets over three different years - 2015, 2016, 2017. All the analysis steps mentioned below should be done for three different years. Each metric you derive should be compared across the three years. Use the Fiscal years as per the files.
Note: Although the broad goal of any analysis of this type would indeed be better parking and fewer tickets, we are not looking for recommendations on how to reduce the number of parking tickets - there are no specific points reserved for this.
The purpose of this case study is to conduct an exploratory data analysis that helps you understand the data. Since the size of the dataset is large, your queries will take some time to run, and you will need to identify the correct queries quicker.