Crime Prediction in Vancouver

Contributors: Ramiro Francisco Mejia, Jasmine Ortega, Thomas Siu, Shi Yan Wang

This is the data analysis project of group 24 (Cohort 6, 2022) for DSCI 522 (Data Science workflows); a course in the Master of Data Science program at the University of British Columbia.

Our Motivation

One of the famous science fictions The Minority Report plots the future world, where police utilize technology to predict and arrest criminals before the crime happens. Put aside the ethical debates of the arrestment, we believe it is still essential to identify and predict crimes with good purpose. For example, can we predict the kinds of crime would possibly happen, given a certain neighbourhood and time? Police would then be able to strengthen specific skillsets to cater such law-breaching activities. Also, local government officials could utilize the prediction to adjust related policies.

We are going to build a classification prediction model to predict the types of crimes that happens in Vancouver, based on the location and the time of the day. The data set that is used in the project is originated from The Vancouver Police Department (VPD) [@Data], with the data set called GEODASH OPEN DATA. The data can be found here. The data set represents the types of crime reported in different areas of Vancouver at a particular time from 2003 to 2021. Since the data is being updated by the VPD every week, we will cut-off the data up to 2020 December to ensure our analysis and model are reproducible.

The prediction model will make use of date features (YEAR, MONTH, DAY, HOUR, MINUTE) and location (HUNDRED_BLOCK, NEIGHBOURHOOD, X, Y) to predict types of crime to be happened in vancouver.

To construct a meaningful prediction model, we will address the association between various crime types and areas in Vancouver. We will also study the trends of the number of crimes committed over the years in order to adjust the model with better prediction capability.

Analysis and Prediction

Firstly, we downloaded the raw data from the Vancouver Police Department, followed by data cleaning and normalization. In particular, we normalized one of the features HOUR, that contained extremely high amount of examples at hour 00:00, to spread across 24 hours. Then we split the data into train data (80%) and test data (20%). After that we performed an initial EDA against the train data. For example, we summarised the number of crimes committed by locations throughout the years in a table. We also identified the correlation chart of the data features with the crime types. Detailed EDA report was generated that includes other EDA results. It can be found here.

After EDA, we started to adopt the methodology in supervised machine learning for the prediction. Firstly we created a column transformation object, which transformed the data into a format that the models could process. Since our prediction model is a multi-class classification, we fit the data into model of Dummy Classifier, Logistic Regression, Random Forest Classifier and Ridge Classifier. We selected f-1 score as our benchmark so that we took a balance between false positive and false negative errors. After fitting all the models, we have selected Logistic Regression as the best performing one and conducted hyperparameters tuning. After that, the best model was created for conducting the prediction and scoring against the test data. Results were collected as confusion matrix and classification report to assess the performance of classifying each target class.

The following flow chart illustrates the overall steps:

Figure 1. Flow chart of the analysis process

Report

We publish the detailed report in HTML and Markdown format that includes detained analysis results with support of figures and tables. The final report can be found here.

Usage

To replicate the analysis and run the predictor, first fork the repository to your personal repo and clone to your local environment. Then perform either of the instructions below:

Docker

Note for Mac M1 users: There are known compatibility issues running dockers on Mac M1. User can run with the usual command by increasing resources in the Docker Desktop. However it severely impacts the machine’s performance. To run, add an argument --platform linux/amd64 for the docker run commands.

Run analysis and render report

docker run --rm -it -v $(PWD):/home/jovyan/work hktomy/crime_predictor:latest make all

Clean all files

docker run --rm -it -v $(PWD):/home/jovyan/work hktomy/crime_predictor:latest make clean

Run jupyter lab

docker run --rm -p 8888:8888 -v $(PWD):/home/jovyan/work hktomy/crime_predictor:latest

Conda environment

In case of running without docker, use the following command:

conda env create -f crime_predictor.yaml
conda activate crime_predictor

The conda environment file is here for reference.

R

Download the latest version of R at https://cran.r-project.org. Follow the installer instructions.
In case an error was thrown with pandoc error during the make command in next step:

Error: pandoc version 1.12.3 or higher is required and was not found (see the help page ?rmarkdown::pandoc_available).
Execution halted

Add an environment variable RSTUDIO_PANDOC in the .bash_profile that points to tne pandoc directory. For example: /Applications/RStudio.app/Contents/MacOS/pandoc

Alternatively, use install pandoc in your system using the command conda install pandoc

Analysis execution

Execute the data analysis pipeline of the Crime Vancouver data set by running the following command in terminal from the root directory of this project:

make all

To reset the repository without intermediate or results files, execute the following command in terminal from the root directory of this project:

make clean

Dependencies

In case of replicating the analysis without using conda, the following are the dependencies of the libraries:

Python 3.9 and Python packages:
- docopt=0.6.2
- ipykernal
- ipython=7.29.0
- vega_datasets
- altair_saver
- matplotlib>=3.2.2
- request>=2.24.0
- scikit-learn>=1.0
- pandas>=1.3.*
- pip
- rpy2
- dataframe-image
R version 4.1.1 and R packages:
- knitr
- tidyverse
Other packages:
- pandoc

Dependecy Diagram of Makefile

A dependency diagram of the Makefile is attached here:

You can also find the link here.

Mac M1 specific considerations

Due to the default installation version or R and RStudio is at arm64, it does not compatible with python rpy when executing R scripts together with Python in Jupyter notebook. To resolve, refer to the steps in this issue.

Legal Disclaimer (Data set)

Refer to here for the legal disclaimer of using the data set.

Name		Name	Last commit message	Last commit date
Latest commit History 256 Commits
.github/workflows		.github/workflows
data		data
doc		doc
docker-predictor-render		docker-predictor-render
results		results
src		src
CONTRIBUTING.md		CONTRIBUTING.md
Code_of_Conduct.md		Code_of_Conduct.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.Rmd		README.Rmd
README.md		README.md
crime_predictor.yaml		crime_predictor.yaml
pipeline.sh		pipeline.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crime Prediction in Vancouver

Our Motivation

Analysis and Prediction

Report

Usage

Docker

Conda environment

R

Analysis execution

Dependencies

Dependecy Diagram of Makefile

Mac M1 specific considerations

Legal Disclaimer (Data set)

References

About

Releases 17

Packages

Contributors 4

Languages

License

UBC-MDS/Crime_Prediction_Vancouver

Folders and files

Latest commit

History

Repository files navigation

Crime Prediction in Vancouver

Our Motivation

Analysis and Prediction

Report

Usage

Docker

Conda environment

R

Analysis execution

Dependencies

Dependecy Diagram of Makefile

Mac M1 specific considerations

Legal Disclaimer (Data set)

References

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 17

Packages 0

Contributors 4

Languages

Packages