Skip to content

This project demonstrates a practical approach to spam detection using machine learning and its deployment as a web application

Notifications You must be signed in to change notification settings

Enyaude/Email-Spam-Detection-App

Repository files navigation

Email Spam Detection

Project ID: 20945480 Python License: MIT

Spam emails are a persistent nuisance, but I found that machine learning offers a robust solution to filter them automatically. In this project, I built a spam detection model using Python and deployed it as an interactive web application with Streamlit.

Table of Contents

Project Overview

  • Dataset: SMS Spam Collection Dataset from Kaggle
  • Packages:
    • pandas: I used this for data manipulation and analysis
    • scikit-learn: I used this for machine learning algorithms and utilities
    • matplotlib/seaborn: I used these for data visualization
    • wordcloud: I used this for word cloud generation for text analysis
    • streamlit: I used this for web app deployment
  • Process:
    • I preprocessed data with NLP techniques
    • I performed text classification using Multinomial Naive Bayes (MNB)
    • I evaluated the model with metrics and visualizations
    • I deployed it as a web app using Streamlit

Model Description

I leveraged a Multinomial Naive Bayes (MNB) classifier, optimized for text classification tasks. My MNB model uses word frequency to determine whether an email is spam or ham (non-spam) by analyzing word associations with labeled data.

Code Breakdown

1. Data Loading and Exploration

  • I imported libraries (pandas, scikit-learn, matplotlib, seaborn, wordcloud)
  • I loaded the spam dataset using pandas
  • I explored dataset characteristics (shape, statistics, missing values, duplicates)
  • I visualized spam/ham distribution with plots (e.g., bar charts)

2. Data Preprocessing

  • I created a binary target variable ("Spam") from the "Category" column
  • I generated a WordCloud to visualize frequently used words in spam emails
  • I cleaned text data (e.g., removed punctuation, converted to lowercase)

3. Feature Engineering

  • I used CountVectorizer to transform text messages into numerical features based on word frequency
  • I explored alternative vectorization techniques (e.g., TF-IDF) for experimentation

4. Model Training and Evaluation

  • I split data into training (80%) and testing (20%) sets
  • I implemented an evaluate_model function that:
    • Trains my MNB model on the training data
    • Predicts labels for training and testing sets
    • Generates evaluation metrics (accuracy, precision, recall, F1-score, confusion matrix, ROC curve)

5. Spam Detection Function

  • I created a detect_spam function that:
    • Accepts an email message as input
    • Uses my trained MNB model to classify it as spam or ham

6. Streamlit Deployment

  • I built a Streamlit app with a user-friendly interface
  • I included a text input box for users to enter email messages
  • I displayed the spam/ham prediction upon submission

Example Usage

Python Code

sample_email = "Free Tickets for IPL"
result = detect_spam(sample_email)
print(result)  # Output: "This is a Spam Email!"

Streamlit App Example

import streamlit as st

# Load my trained model (replace with your model loading logic)
model = ...  # Load my trained MNB model from disk

def detect_spam(email_text):
    prediction = model.predict([email_text])[0]
    return "This is a Ham Email!" if prediction == 0 else "This is a Spam Email!"

st.title("Spam Email Detection")
email_text = st.text_input("Enter an email message:", placeholder="Type your email here...")
if email_text:
    result = detect_spam(email_text)
    st.write(f"**Prediction**: {result}")

| Note: I ensure the trained model is properly loaded before running the app.

Deployment Instructions I installed required packages:

pip install streamlit pandas scikit-learn matplotlib seaborn wordcloud

I saved the Streamlit app code in a file named app.py. I navigated to the directory containing app.py and ran:

streamlit run app.py

I opened http://localhost:8501 in my web browser to access the app.

Challenges and Experiences

During my project design, I encountered several challenges and learning experiences:

  • Data Imbalance: I noticed the dataset had more ham emails than spam, which initially skewed my model’s performance. I addressed this by experimenting with oversampling techniques and adjusting class weights in my MNB model.
  • Text Preprocessing: I found cleaning and normalizing text data (e.g., removing special characters, handling stopwords) to be critical but time-consuming. I learned to balance preprocessing rigor with computational efficiency.
  • Feature Selection: I had to choose between CountVectorizer and TF-IDF, which required experimentation. While CountVectorizer was effective, exploring TF-IDF improved my understanding of feature importance in text classification.
  • Model Evaluation: I deepened my insights into model performance trade-offs by interpreting metrics like precision, recall, and the ROC curve, especially for imbalanced datasets.
  • Streamlit Deployment: I faced initial deployment issues, such as model serialization errors, which I resolved by using joblib to save and load my trained model efficiently.
  • Learning Curve: I found gaining familiarity with NLP tools and Streamlit’s API challenging but rewarding, as it enabled me to create an interactive and user-friendly application.

These experiences taught me the importance of iterative testing, robust preprocessing, and clear documentation for reproducible results.

Key Takeaways

  • I demonstrated a practical end-to-end pipeline for spam detection, from data preprocessing to web deployment.
  • My Multinomial Naive Bayes model is a simple yet effective choice for text classification, with opportunities for further optimization (e.g., hyperparameter tuning, ensemble methods).
  • I found that Streamlit simplifies the creation of interactive ML applications, making them accessible to non-technical users.
  • I encourage experimenting with my code to explore advanced NLP techniques (e.g., word embeddings) or alternative models (e.g., SVM, deep learning) for improved performance.

License

I licensed this project under the MIT License.

Acknowledgments

  • I thank Kaggle for providing the SMS Spam Collection Dataset.
  • I thank Streamlit for an intuitive web app framework.
  • I thank the open-source community for robust libraries like scikit-learn and pandas.

I welcome contributions! Feel free to submit issues or pull requests to enhance my project

About

This project demonstrates a practical approach to spam detection using machine learning and its deployment as a web application

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published