Notebook: Customer Churn Analysis Using ML Classification Models.ipynb
This notebook demonstrates a full supervised learning workflow to predict customer churn (whether a customer will leave a service) using classical machine learning classifiers and simple feature selection. The goal is to identify customers at risk of churning so business teams can prioritize retention actions.
The notebook uses exploratory data analysis (EDA), preprocessing, feature selection, training of multiple classifiers, cross-validation, hyperparameter tuning (GridSearch), and various evaluation metrics (accuracy, F1, confusion matrix, ROC/AUC).
The notebook reads the dataset directly from GitHub:
https://raw.githubusercontent.com/iamnaveen1401/Datasets/refs/heads/main/customer_churn.csv
This dataset includes (example) columns: customerID, Gender, SeniorCitizen, Partner, Dependents, Tenure, PhoneService, MultipleLines, InternetService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, Contract, PaperlessBilling, PaymentMethod, MonthlyCharges, TotalCharges, Churn.
If you prefer to store the CSV locally, place it in a data/ folder and update the notebook cell that loads the CSV.
- Open PowerShell and change to the project folder:
Set-Location -Path "C:\Users\iamna\Desktop\My Projects"- Create and activate a virtual environment:
python -m venv .venv
# PowerShell activation
.\.venv\Scripts\Activate.ps1If activation fails (Exit Code 1), you may need to change the PowerShell execution policy (only if you trust the environment):
# Run as Administrator if required
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope CurrentUser
# Then try activation again
.\.venv\Scripts\Activate.ps1- Install common dependencies used by the notebook:
python -m pip install --upgrade pip
pip install pandas numpy matplotlib seaborn scikit-learn xgboost jupyterlab notebook- Start Jupyter and open the notebook:
jupyter lab
# or
jupyter notebook-
Project header & objective
- Short description: predict customer churn to support retention strategies.
-
Data loading & initial inspection
- Loads CSV from the GitHub URL and uses
df.head()anddf.info()to inspect shapes, dtypes and missing values.
- Loads CSV from the GitHub URL and uses
-
Data cleaning (important detail)
- The
TotalChargescolumn can contain blank strings representing missing values. The notebook replaces blanks (" ") with0and converts the column tofloat:df['TotalCharges'] = df['TotalCharges'].replace(" ", 0)df['TotalCharges'] = df['TotalCharges'].astype(float)
- This cleaning step is important because blank strings otherwise prevent numeric ops and conversions.
- The
-
Exploratory Data Analysis (EDA)
- Categorical variables are plotted with countplots (hue=
Churn) to inspect distribution by churn. - Numerical columns are visualized (KDEs, boxplots, scatterplots) and correlations are printed with a heatmap.
- Categorical variables are plotted with countplots (hue=
-
Preprocessing
- Categorical variables are label-encoded using
sklearn.preprocessing.LabelEncoder(the notebook stores encoders inlabelcondersdict). - Feature and target splits:
X = df.drop(columns=['customerID', 'Churn'])y = df['Churn']
- Train/test split uses
test_size=0.2andrandom_state=42.
- Categorical variables are label-encoded using
-
Feature selection
- Random Forest feature importances are computed with a
RandomForestClassifier(example:n_estimators=200, max_depth=4). SelectKBestwithf_classifis also used to choose top-k features (k=10 in the notebook).
- Random Forest feature importances are computed with a
-
Modeling
- Numerical columns are scaled with
StandardScaler. - The notebook trains and compares multiple classifiers:
- Logistic Regression
- K-Nearest Neighbors
- Decision Tree
- Random Forest
- Support Vector Classifier (SVC)
- Gaussian Naive Bayes
- XGBoost (
xgboost.XGBClassifier)
- Each classifier is trained and evaluated on the test set; accuracy and F1 score are recorded.
- Numerical columns are scaled with
-
Evaluation & visualization
- Confusion matrix (display and heatmap), classification report (precision/recall/F1), ROC curves and AUC for each classifier.
-
Cross-validation
- 5-fold
KFoldcross-validation is applied usingcross_val_scorefor overall stability checks.
- 5-fold
-
Hyperparameter tuning
GridSearchCVis used (example provides a grid forRandomForestClassifierwithn_estimators,max_depth,min_samples_split) withscoring='f1_weighted'.
- Final comparisons and conclusions
- Notebook prints comparisons, and discusses that Logistic Regression performs strongly on this dataset (per the provided results).
- Inputs: a CSV file with the columns listed above. The notebook expects
Churncolumn (Yes/No), andTotalChargesmay contain blank strings. - Outputs: evaluation tables and plots, trained model objects in memory. The notebook prints and plots metrics (accuracy, F1, confusion matrix, ROC/AUC) and displays feature importance.
- Use deterministic random seeds where possible:
random_state=42is used fortrain_test_splitandKFold.- For full reproducibility across libraries, set seeds for
numpy,random, and frameworks used.
- If results must be kept, save models with
joblib.dump(model, 'models/model_name.joblib').
- Activation failing with Exit Code 1: adjust PowerShell execution policy as shown above.
TotalChargesconversion error: check for empty strings; the notebook already replaces these with0before casting to float.XGBoostimport error: installxgboostviapip install xgboost(may require a compiler on some systems; considerpip install xgboostwheel or use conda if needed).- If plotting or display is not working in Jupyter, ensure you launched the environment where packages are installed and restart the kernel after installing.
You can create a requirements.txt with the following minimal packages (pin versions once you confirm working versions):
- pandas
- numpy
- matplotlib
- seaborn
- scikit-learn
- xgboost
- jupyterlab
- notebook
I can create a requirements.txt file with pinned versions if you want — say the word and I’ll add it.
- Ensure your virtual environment is active (see setup above).
- Install dependencies.
- Launch Jupyter Lab/Notebook in this folder.
- Open
Customer Churn Analysis Using ML Classification Models.ipynb. - Run cells from the top. Inspect the data-loading and preprocessing cells first (especially the
TotalChargescleaning) to confirm dataset path and any local adjustments you need.
- Save final trained models to
models/and outputs (reports, confusion matrices) toresults/for reproducibility. - Add a small sample of the dataset to
data/sample_customer_churn.csvso others can quickly test the notebook without downloading the full dataset. - Address class imbalance (if present) using class weights, SMOTE, or subsampling.
- Add an automated evaluation script (e.g.,
evaluate.py) to run evaluations without opening the notebook.
- Add your preferred license (MIT, Apache 2.0, etc.) and author contact information here.
If you want, I can now:
- Create a
requirements.txtin this folder with pinned versions observed from a working run, or - Add a short
data/README.mddescribing where to download the dataset and a small sample, or - Make the README even more concise or make a one-page summary for sharing.
Tell me which next step you want and I'll proceed.