This project focuses on detecting phishing websites by analyzing URLs using machine learning.
The workflow includes dataset cleaning, feature engineering, training multiple classification models, selecting the best-performing one, and allowing real-time predictions for any user-entered URL.
- Clean and prepare the provided phishing dataset.
- Extract important numerical features from the URL dataset.
- Train three ML models:
- Logistic Regression
- Random Forest
- XGBoost
- Evaluate each model using accuracy, precision, recall, and F1-score.
- Automatically select and save the best model.
- Build a hybrid URL feature extractor to analyze new URLs.
- Predict whether a given URL is Phishing or Legitimate.
- File Used:
provided_dataset.csv - Contains:
- URL-based features
- Host/lexical properties
- Labels indicating phishing or legitimate
- The column
statusis converted into a binary label:1β Phishing0β Legitimate
- Python
- Pandas, NumPy β data handling
- Matplotlib, Seaborn β basic analysis/visualization
- Scikit-learn β ML models, preprocessing, evaluation
- XGBoost β gradient boosting classifier
- WHOIS, socket, requests β real-time URL feature extraction
- Joblib β saving the trained model and feature columns
The dataset goes through the following cleaning steps:
- Removing constant and high-missing columns
- Converting object-type numeric values to actual numbers
- Dropping columns that cannot be converted
- Filling
NaNandinfvalues using median values - Preparing feature matrix X and label vector y
- Splitting into 70% train and 30% test
The project trains and evaluates the following classifiers:
- Logistic Regression
- Random Forest Classifier
- XGBoost Classifier
Each model is evaluated using:
- Accuracy
- Precision
- Recall
- F1-score
The model with the highest F1-score is chosen as the final model.
This best model is saved automatically as:
BestModelName.joblib
feature_columns.joblib
For real-time URL prediction, the project extracts features such as:
- URL length
- Hostname length
- Count of dots, hyphens, slashes, special characters
- Digit count & digit-to-length ratio
- WHOIS information (domain age, registration length)
- DNS record validation
- Simple SSL status
- Redirect count
Unavailable or unsupported features are assigned a default value (-1) to maintain column alignment.
The user can enter any URL in the console: Enter the url:
The system outputs:
- Extracted feature vector
- Prediction β Phishing or Legitimate
- Probability scores (if supported by the model)
Example: URL: http://example.com
Prediction: Legitimate Probabilities: [0.93, 0.07]
Ritesh Kumar Pandit
B.Tech CSE β IILM University