Project :Identify Fraud from Enron Email
Machine learning is a type of artificial intelligence (AI) that provides computers with the ability to learn without being explicitly programmed. It focuses on the development of computer programs that can change when exposed to new data. It brings together computer science and statistics to harness that predictive power.
The goal of this project is to use the given ENRON data and come up with a predictive model which will identify an individual as a "Person of Interest (POI)”. Machine learning helps in learning the emailing habits of POIs and non-POIs and find any pattern in their emails and test our predictive model to identify an individual as a POI or not.
The complete report is available in identifyfraudfromenronemail_finalproject.pdf file
Dataset used:
In 2000, Enron was one of the largest companies in the United States. By 2002, it had collapsed into bankruptcy due to widespread corporate fraud. In the resulting Federal investigation, a significant amount of typically confidential information entered into the public record, including tens of thousands of emails and detailed financial data for top executives. In this project, you will play detective, and put your new skills to use by building a person of interest identifier based on financial and email data made public as a result of the Enron scandal. To assist you in your detective work, we've combined this data with a hand-generated list of persons of interest in the fraud case, which means individuals who were indicted, reached a settlement or plea deal with the government, or testified in exchange for prosecution immunity.
The Enron dataset contains 146 records with 1 labeled feature (POI), 14 financial features and 6 email features. The value True/False of the POI feature informs whether the individual is a POI or NON POI. The ENRON data have 21 features in total. There are 18 POIs and 128 NON POIs. Some of the features have missing values and represented as 'NaN' .POI cannot be 'NaN
This project helped me to
- Deal with an imperfect, real-world dataset
- Validate a machine learning result using test data
- Evaluate a machine learning result using quantitative metrics
- Create, select and transform features
- Compare the performance of machine learning algorithms
- Tune machine learning algorithms for maximum performance
- Communicate your machine learning algorithm results clearly