Skip to content

This project predicts customer churn using various machine learning classification models. It includes cross-validation, hyperparameter tuning with GridSearchCV, and model performance comparison to identify the best model for churn prediction and customer retention insights.

Notifications You must be signed in to change notification settings

iamnaveen1401/Customer-Churn-Analysis-Using-ML-Classification-Models

Repository files navigation

Customer Churn Analysis Using ML Classification Models

Notebook: Customer Churn Analysis Using ML Classification Models.ipynb


Project overview

This notebook demonstrates a full supervised learning workflow to predict customer churn (whether a customer will leave a service) using classical machine learning classifiers and simple feature selection. The goal is to identify customers at risk of churning so business teams can prioritize retention actions.

The notebook uses exploratory data analysis (EDA), preprocessing, feature selection, training of multiple classifiers, cross-validation, hyperparameter tuning (GridSearch), and various evaluation metrics (accuracy, F1, confusion matrix, ROC/AUC).

Data source

The notebook reads the dataset directly from GitHub:

https://raw.githubusercontent.com/iamnaveen1401/Datasets/refs/heads/main/customer_churn.csv

This dataset includes (example) columns: customerID, Gender, SeniorCitizen, Partner, Dependents, Tenure, PhoneService, MultipleLines, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, Contract, PaperlessBilling, PaymentMethod, MonthlyCharges, TotalCharges, Churn.

If you prefer to store the CSV locally, place it in a data/ folder and update the notebook cell that loads the CSV.

Quick setup (Windows PowerShell)

  1. Open PowerShell and change to the project folder:
Set-Location -Path "C:\Users\iamna\Desktop\My Projects"
  1. Create and activate a virtual environment:
python -m venv .venv
# PowerShell activation
.\.venv\Scripts\Activate.ps1

If activation fails (Exit Code 1), you may need to change the PowerShell execution policy (only if you trust the environment):

# Run as Administrator if required
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
# Then try activation again
.\.venv\Scripts\Activate.ps1
  1. Install common dependencies used by the notebook:
python -m pip install --upgrade pip
pip install pandas numpy matplotlib seaborn scikit-learn xgboost jupyterlab notebook
  1. Start Jupyter and open the notebook:
jupyter lab
# or
jupyter notebook

What the notebook does (section-by-section)

  1. Project header & objective

    • Short description: predict customer churn to support retention strategies.
  2. Data loading & initial inspection

    • Loads CSV from the GitHub URL and uses df.head() and df.info() to inspect shapes, dtypes and missing values.
  3. Data cleaning (important detail)

    • The TotalCharges column can contain blank strings representing missing values. The notebook replaces blanks (" ") with 0 and converts the column to float:
      • df['TotalCharges'] = df['TotalCharges'].replace(" ", 0)
      • df['TotalCharges'] = df['TotalCharges'].astype(float)
    • This cleaning step is important because blank strings otherwise prevent numeric ops and conversions.
  4. Exploratory Data Analysis (EDA)

    • Categorical variables are plotted with countplots (hue=Churn) to inspect distribution by churn.
    • Numerical columns are visualized (KDEs, boxplots, scatterplots) and correlations are printed with a heatmap.
  5. Preprocessing

    • Categorical variables are label-encoded using sklearn.preprocessing.LabelEncoder (the notebook stores encoders in labelconders dict).
    • Feature and target splits:
      • X = df.drop(columns=['customerID', 'Churn'])
      • y = df['Churn']
    • Train/test split uses test_size=0.2 and random_state=42.
  6. Feature selection

    • Random Forest feature importances are computed with a RandomForestClassifier (example: n_estimators=200, max_depth=4).
    • SelectKBest with f_classif is also used to choose top-k features (k=10 in the notebook).
  7. Modeling

    • Numerical columns are scaled with StandardScaler.
    • The notebook trains and compares multiple classifiers:
      • Logistic Regression
      • K-Nearest Neighbors
      • Decision Tree
      • Random Forest
      • Support Vector Classifier (SVC)
      • Gaussian Naive Bayes
      • XGBoost (xgboost.XGBClassifier)
    • Each classifier is trained and evaluated on the test set; accuracy and F1 score are recorded.
  8. Evaluation & visualization

    • Confusion matrix (display and heatmap), classification report (precision/recall/F1), ROC curves and AUC for each classifier.
  9. Cross-validation

    • 5-fold KFold cross-validation is applied using cross_val_score for overall stability checks.
  10. Hyperparameter tuning

  • GridSearchCV is used (example provides a grid for RandomForestClassifier with n_estimators, max_depth, min_samples_split) with scoring='f1_weighted'.
  1. Final comparisons and conclusions
  • Notebook prints comparisons, and discusses that Logistic Regression performs strongly on this dataset (per the provided results).

Inputs / outputs contract

  • Inputs: a CSV file with the columns listed above. The notebook expects Churn column (Yes/No), and TotalCharges may contain blank strings.
  • Outputs: evaluation tables and plots, trained model objects in memory. The notebook prints and plots metrics (accuracy, F1, confusion matrix, ROC/AUC) and displays feature importance.

Reproducibility & recommended settings

  • Use deterministic random seeds where possible:
    • random_state=42 is used for train_test_split and KFold.
    • For full reproducibility across libraries, set seeds for numpy, random, and frameworks used.
  • If results must be kept, save models with joblib.dump(model, 'models/model_name.joblib').

Common pitfalls & troubleshooting

  • Activation failing with Exit Code 1: adjust PowerShell execution policy as shown above.
  • TotalCharges conversion error: check for empty strings; the notebook already replaces these with 0 before casting to float.
  • XGBoost import error: install xgboost via pip install xgboost (may require a compiler on some systems; consider pip install xgboost wheel or use conda if needed).
  • If plotting or display is not working in Jupyter, ensure you launched the environment where packages are installed and restart the kernel after installing.

Suggested dependency list

You can create a requirements.txt with the following minimal packages (pin versions once you confirm working versions):

  • pandas
  • numpy
  • matplotlib
  • seaborn
  • scikit-learn
  • xgboost
  • jupyterlab
  • notebook

I can create a requirements.txt file with pinned versions if you want — say the word and I’ll add it.

How to run (step-by-step)

  1. Ensure your virtual environment is active (see setup above).
  2. Install dependencies.
  3. Launch Jupyter Lab/Notebook in this folder.
  4. Open Customer Churn Analysis Using ML Classification Models.ipynb.
  5. Run cells from the top. Inspect the data-loading and preprocessing cells first (especially the TotalCharges cleaning) to confirm dataset path and any local adjustments you need.

Extensions & next steps (optional improvements)

  • Save final trained models to models/ and outputs (reports, confusion matrices) to results/ for reproducibility.
  • Add a small sample of the dataset to data/sample_customer_churn.csv so others can quickly test the notebook without downloading the full dataset.
  • Address class imbalance (if present) using class weights, SMOTE, or subsampling.
  • Add an automated evaluation script (e.g., evaluate.py) to run evaluations without opening the notebook.

License & contact

  • Add your preferred license (MIT, Apache 2.0, etc.) and author contact information here.

If you want, I can now:

  • Create a requirements.txt in this folder with pinned versions observed from a working run, or
  • Add a short data/README.md describing where to download the dataset and a small sample, or
  • Make the README even more concise or make a one-page summary for sharing.

Tell me which next step you want and I'll proceed.

About

This project predicts customer churn using various machine learning classification models. It includes cross-validation, hyperparameter tuning with GridSearchCV, and model performance comparison to identify the best model for churn prediction and customer retention insights.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published