Customer Churn Analysis Using ML Classification Models

Notebook: Customer Churn Analysis Using ML Classification Models.ipynb

Project overview

This notebook demonstrates a full supervised learning workflow to predict customer churn (whether a customer will leave a service) using classical machine learning classifiers and simple feature selection. The goal is to identify customers at risk of churning so business teams can prioritize retention actions.

The notebook uses exploratory data analysis (EDA), preprocessing, feature selection, training of multiple classifiers, cross-validation, hyperparameter tuning (GridSearch), and various evaluation metrics (accuracy, F1, confusion matrix, ROC/AUC).

Data source

The notebook reads the dataset directly from GitHub:

https://raw.githubusercontent.com/iamnaveen1401/Datasets/refs/heads/main/customer_churn.csv

This dataset includes (example) columns: customerID, Gender, SeniorCitizen, Partner, Dependents, Tenure, PhoneService, MultipleLines, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, Contract, PaperlessBilling, PaymentMethod, MonthlyCharges, TotalCharges, Churn.

If you prefer to store the CSV locally, place it in a data/ folder and update the notebook cell that loads the CSV.

Quick setup (Windows PowerShell)

Open PowerShell and change to the project folder:

Set-Location -Path "C:\Users\iamna\Desktop\My Projects"

Create and activate a virtual environment:

python -m venv .venv
# PowerShell activation
.\.venv\Scripts\Activate.ps1

If activation fails (Exit Code 1), you may need to change the PowerShell execution policy (only if you trust the environment):

# Run as Administrator if required
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
# Then try activation again
.\.venv\Scripts\Activate.ps1

Install common dependencies used by the notebook:

python -m pip install --upgrade pip
pip install pandas numpy matplotlib seaborn scikit-learn xgboost jupyterlab notebook

Start Jupyter and open the notebook:

jupyter lab
# or
jupyter notebook

What the notebook does (section-by-section)

Project header & objective
- Short description: predict customer churn to support retention strategies.
Data loading & initial inspection
- Loads CSV from the GitHub URL and uses df.head() and df.info() to inspect shapes, dtypes and missing values.
Data cleaning (important detail)
- The TotalCharges column can contain blank strings representing missing values. The notebook replaces blanks (" ") with 0 and converts the column to float:
  - df['TotalCharges'] = df['TotalCharges'].replace(" ", 0)
  - df['TotalCharges'] = df['TotalCharges'].astype(float)
- This cleaning step is important because blank strings otherwise prevent numeric ops and conversions.
Exploratory Data Analysis (EDA)
- Categorical variables are plotted with countplots (hue=Churn) to inspect distribution by churn.
- Numerical columns are visualized (KDEs, boxplots, scatterplots) and correlations are printed with a heatmap.
Preprocessing
- Categorical variables are label-encoded using sklearn.preprocessing.LabelEncoder (the notebook stores encoders in labelconders dict).
- Feature and target splits:
  - X = df.drop(columns=['customerID', 'Churn'])
  - y = df['Churn']
- Train/test split uses test_size=0.2 and random_state=42.
Feature selection
- Random Forest feature importances are computed with a RandomForestClassifier (example: n_estimators=200, max_depth=4).
- SelectKBest with f_classif is also used to choose top-k features (k=10 in the notebook).
Modeling
- Numerical columns are scaled with StandardScaler.
- The notebook trains and compares multiple classifiers:
  - Logistic Regression
  - K-Nearest Neighbors
  - Decision Tree
  - Random Forest
  - Support Vector Classifier (SVC)
  - Gaussian Naive Bayes
  - XGBoost (xgboost.XGBClassifier)
- Each classifier is trained and evaluated on the test set; accuracy and F1 score are recorded.
Evaluation & visualization
- Confusion matrix (display and heatmap), classification report (precision/recall/F1), ROC curves and AUC for each classifier.
Cross-validation
- 5-fold KFold cross-validation is applied using cross_val_score for overall stability checks.
Hyperparameter tuning

GridSearchCV is used (example provides a grid for RandomForestClassifier with n_estimators, max_depth, min_samples_split) with scoring='f1_weighted'.

Final comparisons and conclusions

Notebook prints comparisons, and discusses that Logistic Regression performs strongly on this dataset (per the provided results).

Inputs / outputs contract

Inputs: a CSV file with the columns listed above. The notebook expects Churn column (Yes/No), and TotalCharges may contain blank strings.
Outputs: evaluation tables and plots, trained model objects in memory. The notebook prints and plots metrics (accuracy, F1, confusion matrix, ROC/AUC) and displays feature importance.

Reproducibility & recommended settings

Use deterministic random seeds where possible:
- random_state=42 is used for train_test_split and KFold.
- For full reproducibility across libraries, set seeds for numpy, random, and frameworks used.
If results must be kept, save models with joblib.dump(model, 'models/model_name.joblib').

Common pitfalls & troubleshooting

Activation failing with Exit Code 1: adjust PowerShell execution policy as shown above.
TotalCharges conversion error: check for empty strings; the notebook already replaces these with 0 before casting to float.
XGBoost import error: install xgboost via pip install xgboost (may require a compiler on some systems; consider pip install xgboost wheel or use conda if needed).
If plotting or display is not working in Jupyter, ensure you launched the environment where packages are installed and restart the kernel after installing.

Suggested dependency list

You can create a requirements.txt with the following minimal packages (pin versions once you confirm working versions):

pandas
numpy
matplotlib
seaborn
scikit-learn
xgboost
jupyterlab
notebook

I can create a requirements.txt file with pinned versions if you want — say the word and I’ll add it.

How to run (step-by-step)

Ensure your virtual environment is active (see setup above).
Install dependencies.
Launch Jupyter Lab/Notebook in this folder.
Open Customer Churn Analysis Using ML Classification Models.ipynb.
Run cells from the top. Inspect the data-loading and preprocessing cells first (especially the TotalCharges cleaning) to confirm dataset path and any local adjustments you need.

Extensions & next steps (optional improvements)

Save final trained models to models/ and outputs (reports, confusion matrices) to results/ for reproducibility.
Add a small sample of the dataset to data/sample_customer_churn.csv so others can quickly test the notebook without downloading the full dataset.
Address class imbalance (if present) using class weights, SMOTE, or subsampling.
Add an automated evaluation script (e.g., evaluate.py) to run evaluations without opening the notebook.

License & contact

Add your preferred license (MIT, Apache 2.0, etc.) and author contact information here.

If you want, I can now:

Create a requirements.txt in this folder with pinned versions observed from a working run, or
Add a short data/README.md describing where to download the dataset and a small sample, or
Make the README even more concise or make a one-page summary for sharing.

Tell me which next step you want and I'll proceed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Customer Churn Analysis Using ML Classification Models

Project overview

Data source

Quick setup (Windows PowerShell)

What the notebook does (section-by-section)

Inputs / outputs contract

Reproducibility & recommended settings

Common pitfalls & troubleshooting

Suggested dependency list

How to run (step-by-step)

Extensions & next steps (optional improvements)

License & contact

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Customer Churn Analysis Using ML Classification Models.ipynb		Customer Churn Analysis Using ML Classification Models.ipynb
README.md		README.md
customer_churn.csv		customer_churn.csv

iamnaveen1401/Customer-Churn-Analysis-Using-ML-Classification-Models

Folders and files

Latest commit

History

Repository files navigation

Customer Churn Analysis Using ML Classification Models

Project overview

Data source

Quick setup (Windows PowerShell)

What the notebook does (section-by-section)

Inputs / outputs contract

Reproducibility & recommended settings

Common pitfalls & troubleshooting

Suggested dependency list

How to run (step-by-step)

Extensions & next steps (optional improvements)

License & contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages