Spam emails are a persistent nuisance, but I found that machine learning offers a robust solution to filter them automatically. In this project, I built a spam detection model using Python and deployed it as an interactive web application with Streamlit.
- Project Overview
- Model Description
- Code Breakdown
- Example Usage
- Deployment Instructions
- Challenges and Experiences
- Key Takeaways
- License
- Acknowledgments
- Dataset: SMS Spam Collection Dataset from Kaggle
- Packages:
pandas
: I used this for data manipulation and analysisscikit-learn
: I used this for machine learning algorithms and utilitiesmatplotlib
/seaborn
: I used these for data visualizationwordcloud
: I used this for word cloud generation for text analysisstreamlit
: I used this for web app deployment
- Process:
- I preprocessed data with NLP techniques
- I performed text classification using Multinomial Naive Bayes (MNB)
- I evaluated the model with metrics and visualizations
- I deployed it as a web app using Streamlit
I leveraged a Multinomial Naive Bayes (MNB) classifier, optimized for text classification tasks. My MNB model uses word frequency to determine whether an email is spam or ham (non-spam) by analyzing word associations with labeled data.
- I imported libraries (
pandas
,scikit-learn
,matplotlib
,seaborn
,wordcloud
) - I loaded the spam dataset using
pandas
- I explored dataset characteristics (shape, statistics, missing values, duplicates)
- I visualized spam/ham distribution with plots (e.g., bar charts)
- I created a binary target variable ("Spam") from the "Category" column
- I generated a WordCloud to visualize frequently used words in spam emails
- I cleaned text data (e.g., removed punctuation, converted to lowercase)
- I used
CountVectorizer
to transform text messages into numerical features based on word frequency - I explored alternative vectorization techniques (e.g., TF-IDF) for experimentation
- I split data into training (80%) and testing (20%) sets
- I implemented an
evaluate_model
function that:- Trains my MNB model on the training data
- Predicts labels for training and testing sets
- Generates evaluation metrics (accuracy, precision, recall, F1-score, confusion matrix, ROC curve)
- I created a
detect_spam
function that:- Accepts an email message as input
- Uses my trained MNB model to classify it as spam or ham
- I built a Streamlit app with a user-friendly interface
- I included a text input box for users to enter email messages
- I displayed the spam/ham prediction upon submission
sample_email = "Free Tickets for IPL"
result = detect_spam(sample_email)
print(result) # Output: "This is a Spam Email!"
import streamlit as st
# Load my trained model (replace with your model loading logic)
model = ... # Load my trained MNB model from disk
def detect_spam(email_text):
prediction = model.predict([email_text])[0]
return "This is a Ham Email!" if prediction == 0 else "This is a Spam Email!"
st.title("Spam Email Detection")
email_text = st.text_input("Enter an email message:", placeholder="Type your email here...")
if email_text:
result = detect_spam(email_text)
st.write(f"**Prediction**: {result}")
| Note: I ensure the trained model is properly loaded before running the app.
Deployment Instructions I installed required packages:
pip install streamlit pandas scikit-learn matplotlib seaborn wordcloud
I saved the Streamlit app code in a file named app.py. I navigated to the directory containing app.py and ran:
streamlit run app.py
I opened http://localhost:8501 in my web browser to access the app.
During my project design, I encountered several challenges and learning experiences:
- Data Imbalance: I noticed the dataset had more ham emails than spam, which initially skewed my model’s performance. I addressed this by experimenting with oversampling techniques and adjusting class weights in my MNB model.
- Text Preprocessing: I found cleaning and normalizing text data (e.g., removing special characters, handling stopwords) to be critical but time-consuming. I learned to balance preprocessing rigor with computational efficiency.
- Feature Selection: I had to choose between
CountVectorizer
andTF-IDF
, which required experimentation. WhileCountVectorizer
was effective, exploringTF-IDF
improved my understanding of feature importance in text classification. - Model Evaluation: I deepened my insights into model performance trade-offs by interpreting metrics like precision, recall, and the ROC curve, especially for imbalanced datasets.
- Streamlit Deployment: I faced initial deployment issues, such as model serialization errors, which I resolved by using
joblib
to save and load my trained model efficiently. - Learning Curve: I found gaining familiarity with NLP tools and Streamlit’s API challenging but rewarding, as it enabled me to create an interactive and user-friendly application.
These experiences taught me the importance of iterative testing, robust preprocessing, and clear documentation for reproducible results.
- I demonstrated a practical end-to-end pipeline for spam detection, from data preprocessing to web deployment.
- My Multinomial Naive Bayes model is a simple yet effective choice for text classification, with opportunities for further optimization (e.g., hyperparameter tuning, ensemble methods).
- I found that Streamlit simplifies the creation of interactive ML applications, making them accessible to non-technical users.
- I encourage experimenting with my code to explore advanced NLP techniques (e.g., word embeddings) or alternative models (e.g., SVM, deep learning) for improved performance.
I licensed this project under the MIT License.
- I thank Kaggle for providing the SMS Spam Collection Dataset.
- I thank Streamlit for an intuitive web app framework.
- I thank the open-source community for robust libraries like
scikit-learn
andpandas
.
I welcome contributions! Feel free to submit issues or pull requests to enhance my project