Skip to content

StartrexII/ClassificationStudy

Repository files navigation

Study of classification algorithms

Table of contents

  1. Project Description
  2. Data description
  3. Libraries
  4. Project Installation
  5. Using the project
  6. Authors
  7. Conclusions

Project Description

Some bank has asked you for help: they want to develop a loyalty campaign to retain customers. To do this, he needs to predict the probability of customer outflow and determine whether the customer will leave in the near future.

Problem - to build a classifier that will allow timely identification of outgoing bank customers.

This project the aim is to build such a model and includes:

  • data cleaning (for more information about some of the cleaning methods used, see here)
  • designing features
  • feature conversion
  • building a logistic regression model and determining the optimal parameters of the model
  • building decision tree and random forest models and determining optimal model parameters
  • choosing the best algorithm

Project structure:

  • data - the folder with the original tabular data
  • plotly - a folder with charts for viewing them in the browser
  • dataResearch.ipynb - jupyter-notebook containing an analysis of categorical features
  • ML-3.Practice.Classification.ipynb - jupyter-notebook containing the main project code, data processing and algorithm training
  • requirements.txt - a file with the versions of the modules used, for reproducibility of the code

⬆️To the table of contents

Data description

The dataset contains various information about the bank's customers, including the status - whether he is a customer, or has already stopped using the bank's services(target)

Dataset structure

  • RowNumber — table row number;
  • CustomerId — client ID;
  • Surname — client's last name;
  • CreditScore — the client's credit rating (the higher it is, the more the client took out loans and returned them);
  • Geography — client's country of residence (international bank);
  • Gender — client's gender;
  • Age — client's age;
  • Tenure — how many years has the client been using the bank;
  • Balance — how much money does the client have in bank accounts;
  • NumOfProduct — number of bank services used by the client;
  • HasCrCard — does the customer have a credit card (1 — yes, 0 — no);
  • IsActiveMember — does the client have the status of "active client" (1 — yes, 0 — no);
  • EstimatedSalary — estimated salary of the client;
  • Exited — the status of the departed (1 is a departed client, 0 is a loyal client).

⬆️To the table of contents

Libraries

⬆️To the table of contents

Project Installation

    git clone https://github.com/StartrexII/DataScienceProjects

⬆️To the table of contents

Using

Information about the relationships between categorical features is presented in the jupyter-notebook dataResearch.ipynb. All other information, including the distribution of numerical features, is presented in the jupyter-notebook ML-3.Practice.Classification.ipynb. If the graphs are not displayed on GitHub, you can open them in the browser, they are in the folder plotly/

⬆️To the table of contents

Authors

⬆️To the table of contents

Conclusions

As a result, it was possible to achieve the value of F1-measure 0.68, while trying to use logistic regression, improving it by selecting regularization parameters, adding polynomial features of degree 3, as well as selecting the probability threshold of the object's relation to a certain class. The algorithms of decision trees turned out to be the most effective, while a random forest performed better in the training selection, and its results were comparable to the results of the decision tree in the test sample. At the same time, after selecting the decision-making threshold, the final metric (F1-measure) improved, but it is worth noting that a random forest during selection, in which an optimal balance between completeness and accuracy of the algorithm is achieved, shows a much better metric compared to the decision tree (0.68 - random forest, 0.64 - decision tree).

⬆️To the table of contents

If the information on this project seems interesting or useful to you, then I will be very grateful to you if you mark the repository and profile with ⭐️⭐️⭐️:)

About

the study of classification using logistic regression, decision trees and random forest

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published