A complete Machine Learning + NLP project to classify messages as Spam or Ham. The project includes data preprocessing, Bag of Word vectorization, training multiple ML models, model comparison, ROC & confusion matrix plots and Streamlit deployment.
- Classifies SMS/Email messages as spam or ham.
- Preprocessing includes: cleaning text, removing stopwords, tokenization.
- Converts text into numerical features using Bag of Word vectorization.
- Trains multiple machine learning models and evaluates them using cross-validation.
- Selects the best-performing model automatically.
- Provides visualizations: Confusion Matrix, ROC Curve.
- Deployable via Streamlit for interactive prediction.
- Load Dataset →
preprocessing/load_data.py - Clean Text →
preprocessing/clean_text.py - Vectorization (BOW) →
preprocessing/vectorize.py - Train Multiple Models →
preprocessing/train.py - Evaluate Models →
preprocessing/evaluate.py - Select Best Model & Save →
models/best_spam_model.pkl+models/vectorizer.pkl - Visualize Metrics → Confusion Matrix, ROC
- Deploy with Streamlit →
app.py
| Model | Cross-Validated Accuracy |
|---|---|
| Multinomial Naive Bayes | 0.9758 |
| K-Nearest Neighbors | 0.9026 |
| Gradient Boosting | 0.9589 |
| AdaBoost | 0.9147 |
Multinomial Naive Bayes
Saved as: models/best_spam_model.pkl
Metrics calculated for test data:
- Accuracy : 0.9749
- Precision : 0.8961
- Recall : 0.9200
- F1 Score : 0.9079
Visualizations Saved in reports/plots/:
- Confusion Matrix
- ROC Curve
# Clone the repo
git clone https://github.com/roshan-acharya/SpamClassifier
cd SpamClassifier
# Create virtual environment (optional)
python -m venv venv
source venv/bin/activate # Linux/macOS
venv\Scripts\activate # Windows
# Install dependencies
pip install -r requirements.txt
-
Run Training Pipeline
python pipeline/pipeline.py
-
Run Streamlit App
streamlit run app.py
-
Python
-
Pandas
-
NumPy
-
Scikit-learn
-
Matplotlib, Seaborn
-
Streamlit
-
Pickle (for saving models)
Roshan Acharya
AI/ML Enthusiast

