PySpark_Big-Data-Project

Using SparkR

Problem Statement

Big data analytics allows you to analyse data at scale. It has applications in almost every industry in the world. Let’s consider an unconventional application that you wouldn’t ordinarily encounter.

New York City is a thriving metropolis. Just like most other metros that size, one of the biggest problems its citizens face is parking. The classic combination of a huge number of cars and a cramped geography is the exact recipe that leads to a huge number of parking tickets.

In an attempt to scientifically analyse this phenomenon, the NYC Police Department has collected data for parking tickets. Out of these, the data files from 2014 to 2017 are publicly available on Kaggle. We will try and perform some exploratory analysis on this data. Spark will allow us to analyse the full files at high speeds, as opposed to taking a series of random samples that will approximate the population.

For the scope of this analysis, we wish to compare the phenomenon related to parking tickets over three different years - 2015, 2016, 2017. All the analysis steps mentioned below should be done for three different years. Each metric you derive should be compared across the three years. Use the Fiscal years as per the files.

Note: Although the broad goal of any analysis of this type would indeed be better parking and fewer tickets, we are not looking for recommendations on how to reduce the number of parking tickets - there are no specific points reserved for this.

The purpose of this case study is to conduct an exploratory data analysis that helps you understand the data. Since the size of the dataset is large, your queries will take some time to run, and you will need to identify the correct queries quicker.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitignore		.gitignore
LICENSE		LICENSE
NYCParkingTicket-Analysis1.pdf		NYCParkingTicket-Analysis1.pdf
README.md		README.md
Spark-Lab0-Tutorial.ipynb		Spark-Lab0-Tutorial.ipynb
Spark-Lab1-WordCount.ipynb		Spark-Lab1-WordCount.ipynb
Spark-Lab2-ApacheLog.ipynb		Spark-Lab2-ApacheLog.ipynb
Spark-Lab3-EntityResolution.ipynb		Spark-Lab3-EntityResolution.ipynb
Spark-Lab4-MachineLearning.ipynb		Spark-Lab4-MachineLearning.ipynb
Spark-ML-Lab5-NeuroPCA.ipynb		Spark-ML-Lab5-NeuroPCA.ipynb
nyc txi Problem GK Finl.txt		nyc txi Problem GK Finl.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PySpark_Big-Data-Project

About

Uh oh!

Releases

Packages

Languages

License

Asummit/PySpark-ML-project

Folders and files

Latest commit

History

Repository files navigation

PySpark_Big-Data-Project

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages