🧼 SSOSM: Spam Filter

Author: Andrei Cristian
Version: 0.1

📌 Overview

SSOSM (Simple Spam Or Sanitized Mail) is a Python-based spam filtering tool designed to classify text or file content using a Naive Bayes classifier trained on manually labeled data. It can detect spam from raw text files and HTML emails using a lightweight and customizable dataset.

This project is ideal for security research, email filtering prototypes, or educational purposes on NLP and machine learning fundamentals.

📁 Project Structure

.
├── main.py              # Main script containing training and scanning logic
├── requirements.txt     # Dependencies
├── .python-version      # Python version (3.7.12)

🛠 Features

✅ Detects spam content using Multinomial Naive Bayes
✅ Supports plain text and HTML content (auto-strips HTML)
✅ Trains on a CSV dataset (dataset.csv) generated from labeled folders
✅ Scans directories for potential spam files
✅ Outputs results to a log file
✅ Saves and reuses trained model and vectorizer with pickle

🧪 Requirements

Python 3.7.12
Install dependencies with:

pip install -r requirements.txt

⚙️ Usage

1. Print Project Info

python main.py -info info.txt

Writes project metadata to the file info.txt.

2. Scan a Folder for Spam

python main.py -scan <directory_path> <output_file>

Scans all files in the specified directory, classifies them, and writes results to the output file. Each line contains:

<filename>|cln  # Clean
<filename>|inf  # Infected (Spam)

3. (Optional) Train Model from Labeled Dataset

Note: Training logic is currently commented out in main.py. You can enable it manually for re-training.

Directory structure for training:

<root>
├── Lot1/
│   ├── Clean/
│   └── Spam/
└── Lot2/
    ├── Clean/
    └── Spam/

Each subfolder should contain text files. After uncommenting the train() function and adjusting the logic:

# Enable and modify in main.py
# python main.py -train <path_to_data_root>

It generates a new dataset.csv, trains the classifier, and saves:

naive_bayes_clf.pkl – the trained model
naive_bayes_cv.pkl – the fitted CountVectorizer

🔒 Security Notes

Only plain text and HTML are processed (malicious scripts in HTML are stripped).
Binary files are ignored.
This tool is not a full antivirus scanner — it is intended for text-based spam detection.

📄 Output Example

offer.txt|inf
newsletter.html|cln
free-gift.msg|inf

🧹 Future Improvements

✅ Enable training via CLI argument
📊 Add confusion matrix and classification reports
🤖 Extend to detect phishing keywords
🗂 Add support for .eml email formats
🌐 Web dashboard (Flask/FastAPI)

📜 License

This project is currently not licensed. Contact the author for reuse or contributions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧼 SSOSM: Spam Filter

📌 Overview

📁 Project Structure

🛠 Features

🧪 Requirements

⚙️ Usage

1. Print Project Info

2. Scan a Folder for Spam

3. (Optional) Train Model from Labeled Dataset

🔒 Security Notes

📄 Output Example

🧹 Future Improvements

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.python-version		.python-version
README.md		README.md
dataset.csv		dataset.csv
main.py		main.py
naive_bayes_clf.pkl		naive_bayes_clf.pkl
naive_bayes_cv.pkl		naive_bayes_cv.pkl
requirements.txt		requirements.txt

ac999/python-spam-filter

Folders and files

Latest commit

History

Repository files navigation

🧼 SSOSM: Spam Filter

📌 Overview

📁 Project Structure

🛠 Features

🧪 Requirements

⚙️ Usage

1. Print Project Info

2. Scan a Folder for Spam

3. (Optional) Train Model from Labeled Dataset

🔒 Security Notes

📄 Output Example

🧹 Future Improvements

📜 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages