Author: Andrei Cristian
Version: 0.1
SSOSM (Simple Spam Or Sanitized Mail) is a Python-based spam filtering tool designed to classify text or file content using a Naive Bayes classifier trained on manually labeled data. It can detect spam from raw text files and HTML emails using a lightweight and customizable dataset.
This project is ideal for security research, email filtering prototypes, or educational purposes on NLP and machine learning fundamentals.
.
├── main.py # Main script containing training and scanning logic
├── requirements.txt # Dependencies
├── .python-version # Python version (3.7.12)
- ✅ Detects spam content using Multinomial Naive Bayes
- ✅ Supports plain text and HTML content (auto-strips HTML)
- ✅ Trains on a CSV dataset (
dataset.csv
) generated from labeled folders - ✅ Scans directories for potential spam files
- ✅ Outputs results to a log file
- ✅ Saves and reuses trained model and vectorizer with
pickle
Python 3.7.12
Install dependencies with:
pip install -r requirements.txt
python main.py -info info.txt
Writes project metadata to the file info.txt
.
python main.py -scan <directory_path> <output_file>
Scans all files in the specified directory, classifies them, and writes results to the output file. Each line contains:
<filename>|cln # Clean
<filename>|inf # Infected (Spam)
Note: Training logic is currently commented out in main.py
. You can enable it manually for re-training.
Directory structure for training:
<root>
├── Lot1/
│ ├── Clean/
│ └── Spam/
└── Lot2/
├── Clean/
└── Spam/
Each subfolder should contain text files. After uncommenting the train()
function and adjusting the logic:
# Enable and modify in main.py
# python main.py -train <path_to_data_root>
It generates a new dataset.csv
, trains the classifier, and saves:
naive_bayes_clf.pkl
– the trained modelnaive_bayes_cv.pkl
– the fittedCountVectorizer
- Only plain text and HTML are processed (malicious scripts in HTML are stripped).
- Binary files are ignored.
- This tool is not a full antivirus scanner — it is intended for text-based spam detection.
offer.txt|inf
newsletter.html|cln
free-gift.msg|inf
- ✅ Enable training via CLI argument
- 📊 Add confusion matrix and classification reports
- 🤖 Extend to detect phishing keywords
- 🗂 Add support for
.eml
email formats - 🌐 Web dashboard (Flask/FastAPI)
This project is currently not licensed. Contact the author for reuse or contributions.