Skip to content

This is an AI Phishing Website Detector. This accurately describes whether the input URL is phishing or not. Even though it is trained for 80+ features but it can be also used to predict using limited features that can be scraped from the URL itself.

Notifications You must be signed in to change notification settings

ritesh-begin/Phishing_website_prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ›‘οΈ Phishing Website Detector

πŸ“Œ Introduction

This project focuses on detecting phishing websites by analyzing URLs using machine learning.
The workflow includes dataset cleaning, feature engineering, training multiple classification models, selecting the best-performing one, and allowing real-time predictions for any user-entered URL.


🎯 Objectives

  • Clean and prepare the provided phishing dataset.
  • Extract important numerical features from the URL dataset.
  • Train three ML models:
    • Logistic Regression
    • Random Forest
    • XGBoost
  • Evaluate each model using accuracy, precision, recall, and F1-score.
  • Automatically select and save the best model.
  • Build a hybrid URL feature extractor to analyze new URLs.
  • Predict whether a given URL is Phishing or Legitimate.

πŸ“‚ Dataset

  • File Used: provided_dataset.csv
  • Contains:
    • URL-based features
    • Host/lexical properties
    • Labels indicating phishing or legitimate
  • The column status is converted into a binary label:
    • 1 β†’ Phishing
    • 0 β†’ Legitimate

πŸ› οΈ Tools & Libraries

  • Python
  • Pandas, NumPy – data handling
  • Matplotlib, Seaborn – basic analysis/visualization
  • Scikit-learn – ML models, preprocessing, evaluation
  • XGBoost – gradient boosting classifier
  • WHOIS, socket, requests – real-time URL feature extraction
  • Joblib – saving the trained model and feature columns

πŸ”Ž Data Preprocessing

The dataset goes through the following cleaning steps:

  • Removing constant and high-missing columns
  • Converting object-type numeric values to actual numbers
  • Dropping columns that cannot be converted
  • Filling NaN and inf values using median values
  • Preparing feature matrix X and label vector y
  • Splitting into 70% train and 30% test

πŸ€– Machine Learning Models

The project trains and evaluates the following classifiers:

  1. Logistic Regression
  2. Random Forest Classifier
  3. XGBoost Classifier

Each model is evaluated using:

  • Accuracy
  • Precision
  • Recall
  • F1-score

The model with the highest F1-score is chosen as the final model.
This best model is saved automatically as: BestModelName.joblib feature_columns.joblib


πŸ§ͺ Hybrid URL Feature Extraction

For real-time URL prediction, the project extracts features such as:

  • URL length
  • Hostname length
  • Count of dots, hyphens, slashes, special characters
  • Digit count & digit-to-length ratio
  • WHOIS information (domain age, registration length)
  • DNS record validation
  • Simple SSL status
  • Redirect count

Unavailable or unsupported features are assigned a default value (-1) to maintain column alignment.


πŸš€ URL Prediction

The user can enter any URL in the console: Enter the url:

The system outputs:

  • Extracted feature vector
  • Prediction β†’ Phishing or Legitimate
  • Probability scores (if supported by the model)

Example: URL: http://example.com

Prediction: Legitimate Probabilities: [0.93, 0.07]

Author

Ritesh Kumar Pandit

B.Tech CSE β€” IILM University

About

This is an AI Phishing Website Detector. This accurately describes whether the input URL is phishing or not. Even though it is trained for 80+ features but it can be also used to predict using limited features that can be scraped from the URL itself.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published