Email Spam Detection

Spam emails are a persistent nuisance, but I found that machine learning offers a robust solution to filter them automatically. In this project, I built a spam detection model using Python and deployed it as an interactive web application with Streamlit.

Project Overview

Dataset: SMS Spam Collection Dataset from Kaggle
Packages:
- pandas: I used this for data manipulation and analysis
- scikit-learn: I used this for machine learning algorithms and utilities
- matplotlib/seaborn: I used these for data visualization
- wordcloud: I used this for word cloud generation for text analysis
- streamlit: I used this for web app deployment
Process:
- I preprocessed data with NLP techniques
- I performed text classification using Multinomial Naive Bayes (MNB)
- I evaluated the model with metrics and visualizations
- I deployed it as a web app using Streamlit

Model Description

I leveraged a Multinomial Naive Bayes (MNB) classifier, optimized for text classification tasks. My MNB model uses word frequency to determine whether an email is spam or ham (non-spam) by analyzing word associations with labeled data.

Code Breakdown

1. Data Loading and Exploration

I imported libraries (pandas, scikit-learn, matplotlib, seaborn, wordcloud)
I loaded the spam dataset using pandas
I explored dataset characteristics (shape, statistics, missing values, duplicates)
I visualized spam/ham distribution with plots (e.g., bar charts)

2. Data Preprocessing

I created a binary target variable ("Spam") from the "Category" column
I generated a WordCloud to visualize frequently used words in spam emails
I cleaned text data (e.g., removed punctuation, converted to lowercase)

3. Feature Engineering

I used CountVectorizer to transform text messages into numerical features based on word frequency
I explored alternative vectorization techniques (e.g., TF-IDF) for experimentation

4. Model Training and Evaluation

I split data into training (80%) and testing (20%) sets
I implemented an evaluate_model function that:
- Trains my MNB model on the training data
- Predicts labels for training and testing sets
- Generates evaluation metrics (accuracy, precision, recall, F1-score, confusion matrix, ROC curve)

5. Spam Detection Function

I created a detect_spam function that:
- Accepts an email message as input
- Uses my trained MNB model to classify it as spam or ham

6. Streamlit Deployment

I built a Streamlit app with a user-friendly interface
I included a text input box for users to enter email messages
I displayed the spam/ham prediction upon submission

Example Usage

Python Code

sample_email = "Free Tickets for IPL"
result = detect_spam(sample_email)
print(result)  # Output: "This is a Spam Email!"

Streamlit App Example

import streamlit as st

# Load my trained model (replace with your model loading logic)
model = ...  # Load my trained MNB model from disk

def detect_spam(email_text):
    prediction = model.predict([email_text])[0]
    return "This is a Ham Email!" if prediction == 0 else "This is a Spam Email!"

st.title("Spam Email Detection")
email_text = st.text_input("Enter an email message:", placeholder="Type your email here...")
if email_text:
    result = detect_spam(email_text)
    st.write(f"**Prediction**: {result}")

| Note: I ensure the trained model is properly loaded before running the app.

Deployment Instructions I installed required packages:

pip install streamlit pandas scikit-learn matplotlib seaborn wordcloud

I saved the Streamlit app code in a file named app.py. I navigated to the directory containing app.py and ran:

streamlit run app.py

I opened http://localhost:8501 in my web browser to access the app.

Challenges and Experiences

During my project design, I encountered several challenges and learning experiences:

Data Imbalance: I noticed the dataset had more ham emails than spam, which initially skewed my model’s performance. I addressed this by experimenting with oversampling techniques and adjusting class weights in my MNB model.
Text Preprocessing: I found cleaning and normalizing text data (e.g., removing special characters, handling stopwords) to be critical but time-consuming. I learned to balance preprocessing rigor with computational efficiency.
Feature Selection: I had to choose between CountVectorizer and TF-IDF, which required experimentation. While CountVectorizer was effective, exploring TF-IDF improved my understanding of feature importance in text classification.
Model Evaluation: I deepened my insights into model performance trade-offs by interpreting metrics like precision, recall, and the ROC curve, especially for imbalanced datasets.
Streamlit Deployment: I faced initial deployment issues, such as model serialization errors, which I resolved by using joblib to save and load my trained model efficiently.
Learning Curve: I found gaining familiarity with NLP tools and Streamlit’s API challenging but rewarding, as it enabled me to create an interactive and user-friendly application.

These experiences taught me the importance of iterative testing, robust preprocessing, and clear documentation for reproducible results.

Key Takeaways

I demonstrated a practical end-to-end pipeline for spam detection, from data preprocessing to web deployment.
My Multinomial Naive Bayes model is a simple yet effective choice for text classification, with opportunities for further optimization (e.g., hyperparameter tuning, ensemble methods).
I found that Streamlit simplifies the creation of interactive ML applications, making them accessible to non-technical users.
I encourage experimenting with my code to explore advanced NLP techniques (e.g., word embeddings) or alternative models (e.g., SVM, deep learning) for improved performance.

License

I licensed this project under the MIT License.

Acknowledgments

I thank Kaggle for providing the SMS Spam Collection Dataset.
I thank Streamlit for an intuitive web app framework.
I thank the open-source community for robust libraries like scikit-learn and pandas.

I welcome contributions! Feel free to submit issues or pull requests to enhance my project

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.devcontainer		.devcontainer
Sentiment_Analysis_of_Movie_Reviews_App.egg-info		Sentiment_Analysis_of_Movie_Reviews_App.egg-info
data		data
dist		dist
venv		venv
Email_Spam_Detection.ipynb		Email_Spam_Detection.ipynb
README.md		README.md
app.py		app.py
model.py		model.py
requirements.txt		requirements.txt
setup.py		setup.py
spam.csv		spam.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Email Spam Detection

Table of Contents

Project Overview

Model Description

Code Breakdown

1. Data Loading and Exploration

2. Data Preprocessing

3. Feature Engineering

4. Model Training and Evaluation

5. Spam Detection Function

6. Streamlit Deployment

Example Usage

Python Code

Streamlit App Example

Challenges and Experiences

Key Takeaways

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Enyaude/Email-Spam-Detection-App

Folders and files

Latest commit

History

Repository files navigation

Email Spam Detection

Table of Contents

Project Overview

Model Description

Code Breakdown

1. Data Loading and Exploration

2. Data Preprocessing

3. Feature Engineering

4. Model Training and Evaluation

5. Spam Detection Function

6. Streamlit Deployment

Example Usage

Python Code

Streamlit App Example

Challenges and Experiences

Key Takeaways

License

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages