GitHub - KellybLieu/UC_Berkeley_Module17_ComparingClassifiers: Practical Application 3. Assignment 17.1. Goal is to compare the performance of the classifiers, K Nearest Neighbor, Logistic Regression, Decision Trees, and Support Vector Machines utilizing a dataset related to marketing bank products over the telephone.

UC Berkeley Module17 Comparing Classifiers By Kelly Lieu

OVERVIEW--- In this practical application, our goal is to compare the performance of the classifiers we encountered in our class such as, K Nearest Neighbor (KNN), Logistic Regression, Decision Trees, and Support Vector Machines (SVM). We will utilize a dataset related to marketing bank products over the telephone.

DATA--- Our dataset comes from the UCI Machine Learning repository link: https://archive.ics.uci.edu/dataset/222/bank+marketing. The data is from a Portugese banking institution and is a collection of the results of multiple marketing campaigns. We will make use of the article accompanying the dataset here for more information on the data and features.

BUSINESS OBJECTIVE--- Our business objective is to leverage Business Intelligence and Data Mining tools to uncover attributes or patterns that can increase effectiveness of marketing strategies and campaigns for better key performance indicators (KPIs).

METHODOLOGY--- In this exercise, our data scientists will split up this project into multiple phases. We will start with feature engineering, prepare training and test data, then build a baseline model to benchmark results. Next, we will build a simple model, and use pipelines to include Logistic Regression, KNN, Decision Tree, and SVM models. Furthermore, we will compare the performance of the classifiers using different metics. Finally, our team will provide visualization of the results with a recommendation to marketing managers.

OUTLINE--- This Assignment will consist of this README, dataset, and Jupyter notebook the contains the step-by-step methodology implemented and written in Python to arrive at our findings. The overall outline of steps included in the Jupyter notebook are:

Reading the data
Understanding features
Understanding the tasks
Defining the business objectives
Feature engineering and preprocessing data
Setting up training and test data
Starting with a baseline model
Building a simple model
Building, comparing multiple models by various metrics, and visualizing
Improving models by hyperparameter tuning and grid search
Finding the best features and provide recommendation in output

INPUT VARIABLES--- //bank client data: 1 - age (numeric) 2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown') 3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed) 4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown') 5 - default: has credit in default? (categorical: 'no','yes','unknown') 6 - housing: has housing loan? (categorical: 'no','yes','unknown') 7 - loan: has personal loan? (categorical: 'no','yes','unknown') //related with the last contact of the current campaign: 8 - contact: contact communication type (categorical: 'cellular','telephone') 9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec') 10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri') 11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model. //other attributes: 12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact) 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted) 14 - previous: number of contacts performed before this campaign and for this client (numeric) 15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success') //social and economic context attributes 16 - emp.var.rate: employment variation rate - quarterly indicator (numeric) 17 - cons.price.idx: consumer price index - monthly indicator (numeric) 18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric) 19 - euribor3m: euribor 3 month rate - daily indicator (numeric) 20 - nr.employed: number of employees - quarterly indicator (numeric)

OUTPUT VARIABLE--- [desired target] 21 - y - has the client subscribed a term deposit? (binary: 'yes','no')

SUMMARY OF INITIAL RESULTS--- From comparing the performance of 5 classifiers, including the baseline model, the results show that Logistic Regression classifier performed the best. This determination was based on evaluating which model had the highest Test Accuracy. The bar chart in the notebook shows the best performing models in decending order from left to right (e.g., Logistic Regression, SVM, KNN, Baseline, Decision Tree).

HYPERPARAMETER TUNING RESULTS--- After tuning the hyperparameters for each model, the best performing models ranked differently from the models before hyper-parameter tuning. Interestingly, the worst-performing classifier in initial results became the best-performing classifier. Post-tuning, the best performing model is ranked by Test Accuracy in decending order:

Decision Tree with 91.51%
Sequential Vector Machine with 91.18%
Logistic Regression with 91.14%
K Nearest Neighbors with 90.42%
Baseline model with 88.65%

This exercise shows the significance and importance of tuning hyperparameters to improve the outcome of machine learning.

CUSTOMER RECOMMENDATION--- In addition, with Decision Tree as the classifier that performed the best, we rebuilt the individual decision tree model and extracted the best features. The top 2 features with the higest Importance Scores were:

Duration ("duration" in seconds)
Number of Employees ("nr.employed" quarterly)

The Decision Tree classifier suggested that longer duration of contact with a customer will increase the likelihood of a potential buying customer by 44.87%. The classifier also suggested that the second feature, number of employees, indicates that a higher number of employees where a customer is employed has a 24.85% likelihood that a customer would potentially buy. In other words, to improve the outcome of the marketing campaigns, the Decision Tree classifier recommends:

Focusing on increasing the duration of contact with a potential buying customer
Focusing on customers who work for companies with larger number of employees

The decision tree algorithm will be provided to the customer for further analysis.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
.gitattributes		.gitattributes
Module17_ComparingClassifiers_KellyLieu.ipynb		Module17_ComparingClassifiers_KellyLieu.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

UC Berkeley Module17 Comparing Classifiers By Kelly Lieu

About

Uh oh!

Releases

Packages

Languages

KellybLieu/UC_Berkeley_Module17_ComparingClassifiers

Folders and files

Latest commit

History

Repository files navigation

UC Berkeley Module17 Comparing Classifiers By Kelly Lieu

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages