Machine Learning powered phishing website detector using URL & HTML analysis
Built with Python • scikit-learn • Flask • BeautifulSoup
A modular, production-ready phishing website detection system.
A modular Python project for detecting phishing websites using URL and HTML features extracted via BeautifulSoup, trained on a Random Forest model from scikit-learn.
Includes data processing, training, prediction (CLI + API), evaluation, testing, Docker support, and CI/CD.
-
Data Processing
- Load URLs/labels from CSV
- Extract 10 features:
- URL features (5): length, dots,
@, HTTPS, IP - HTML features (5): forms, passwords, iframes, links, scripts
- URL features (5): length, dots,
- Save processed features as Parquet
-
Model
RandomForestClassifiertrained on extracted features- Optional 11th feature: screenshot hash (via Selenium)
-
Training & Evaluation
train.py: Trains model (synthetic fallback if no CSV)evaluate.py: Computes accuracy, precision, recall, F1, ROC-AUC
-
Prediction
- CLI (predict.py) → Single URL prediction
- REST API (api.py) → /scan endpoint for predictions
-
API
- Flask-based, returns JSON:
{ "url": "https://example.com", "score": 0.12, "label": "legit", "explanation": "Contains HTTPS, no suspicious patterns" } -
Extras
- Logging & configuration via
.env - Screenshot stub (optional Selenium support)
- EDA notebook for experiments
- Logging & configuration via
-
Testing
- Pytest suite for data, features, and API (mocked network)
-
Deployment
- Docker & Docker Compose support
- GitHub Actions for CI/CD (lint, train, test)
-
Dependencies
- Minimal core (scikit-learn, Flask, BeautifulSoup4)
- Optional: Selenium for screenshots
💡 Uses synthetic data by default. Replace data/sample_urls.csv with real labeled data (url,label) for production use.
git clone https://github.com/mantrapatil03/phishing-detector phishing-detection
cd phishing-detectionpython3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
#cp .env.example .env
# (Edit paths/config in .env if needed)
If you want to use real URLs, create a CSV:
url,label
https://example.com,0
http://phishingsite.ru/login,1
https://google.com,0
http://fakebank.com@realbank.com,1
https://paypal.com,0
http://192.168.1.5:8080,1
Save it as data/sample_urls.csv.
python -m src.train-
Generates:
models/baseline.joblibdata/processed.parquet
-
Output includes accuracy (≈0.95 on synthetic data)
python -m src.evaluatePrints detailed metrics & classification report.
python -m src.predict --url https://example.comExample Output:
Score: 0.12
Label: legit
Explanation: Contains HTTPS and no suspicious symbolspython -m src.apiTest with curl:
curl -X POST http://localhost:5000/scan \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'Health Check:
curl http://localhost:5000/pytest tests/ -vOpen:
notebooks/EDA_and_experiments.ipynbRun to explore:
-
Data distribution
-
Correlations
-
Model tuning experiments
-
Logging → Configured via
src/logging_config.py(LOG_LEVELin.env) -
Config → Paths & secrets via
src/config.py -
Screenshot Feature → Optional
-
Enable by setting
include_screenshot=Truein `src/model.py``-
Requires Selenium + ChromeDriver (included in Dockerfile)
-
Retrain model after enabling
-
-
Extend Project
-
Add new data sources →
data_processing.py -
Improve model →
ml_helpers.py -
Add features →
features.py -
Tune models → GridSearchCV, XGBoost, LSTM, etc.
-
-
Code Style
black . # line-length = 88 (pyproject.toml)Docker
docker build -t phishing-detection .
docker run -p 5000:5000 phishing-detection- Includes ChromeDriver
- Trains model during build
Docker Compose
docker-compose up- Mounts
./dataand./modelsfor persistence
Production Tips
-
Use Gunicorn/Waitress for Flask
-
Cache HTML to avoid rate limits
-
Use Celery for batch scanning
-
Add Prometheus for monitoring
-
Implement auth & rate-limiting for API
GitHub Actions (.github/workflows/ci.yml):
- Runs on push/PR to main
- Steps:
- Lint (black)
- Install dependencies
- Generate synthetic data
- Train, evaluate, and test model
- Verify model/parquet generation
phishing-detection/
├── README.md
├── VERIFICATION.md
├── SECURITY.md
├── requirements.txt
├── pyproject.toml
├── .gitignore
├── .env.example
├── Dockerfile
├── docker-compose.yml
│
├── data/
│ ├── sample_urls.csv
│ └── processed.parquet
│
├── src/
│ ├── __init__.py
│ ├── config.py
│ ├── logging_config.py
│ ├── utils.py
│ ├── data_processing.py
│ ├── features.py
│ ├── screenshot.py
│ ├── model.py
│ ├── ml_helpers.py
│ ├── train.py
│ ├── evaluate.py
│ ├── predict.py
│ └── api.py
│
├── notebooks/
│ └── EDA_and_experiments.ipynb
│
├── tests/
│ ├── test_features.py
│ ├── test_data_processing.py
│ └── test_api.py
│
├── .github/
│ └── workflows/
│ └── ci.yml
│
├── models/
│ └── baseline.joblib
│
└── LICENSE
| Area | Current | Next Steps |
|---|---|---|
| Data | Synthetic | Use real datasets (e.g., PhishTank, OpenPhish) |
| Features | 10 basic | Add JS, WHOIS, SSL, domain age |
| Model | RandomForest | Try XGBoost, LSTM, BERT |
| Security | No auth/rate-limit | Add JWT, API key, and rate limiting |
| Performance | Synchronous fetch | Switch to async (aiohttp) |
-
Fork the repo
-
Create your feature branch
-
Commit your changes
-
Run tests & lint locally
5.5 Open a Pull Request
Contributions are welcome!
-
Author: Mantra Patil
-
Maintain by Shree Organization
-
Build by CodeM03
If you find this project useful, please ⭐ star this repository and share it with others!
Built with ❤️ by the CodeM03 Company — Stay safe online 🕵️♂️
