Skip to content

Built an ML-powered phishing detection tool using Python, scikit-learn, and NLP to classify emails, achieving 90%+ accuracy in detecting malicious content and blocking phishing attempts in real time.

License

Notifications You must be signed in to change notification settings

Vikashupadhyay01/Phishing-Email-Detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Phishing-Email-Detection

Built an ML-powered phishing detection tool using Python, scikit-learn, and NLP to classify emails, achieving 90%+ accuracy in detecting malicious content and blocking phishing attempts in real time.

Introduction

Phishing attacks continue to be a dominant cybersecurity threat, exploiting deceptive emails to trick recipients into divulging sensitive data or executing malicious actions. These attacks often bypass basic email defenses by mimicking legitimate senders, making robust detection mechanisms essential.

This project implements an automated phishing email detection tool designed to analyze raw email files (.eml) and identify phishing indicators through a multi-layered approach:

  • Signature-based detection using YARA: YARA is a powerful pattern matching engine widely used in malware research and threat hunting. It enables the creation of custom rules to detect specific phishing keywords, suspicious content patterns, or behaviors within email text and headers.

  • Email Authentication Verification: SPF, DKIM, and DMARC: Modern email systems use standards like SPF (Sender Policy Framework), DKIM (DomainKeys Identified Mail), and DMARC (Domain-based Message Authentication, Reporting & Conformance) to validate sender authenticity and prevent email spoofing—common tactics in phishing.

    • SPF checks if the sender’s IP address is authorized to send emails on behalf of the domain.
    • DKIM adds a cryptographic signature to emails, verifying that the message has not been altered and confirming sender legitimacy.
    • DMARC builds on SPF and DKIM, instructing receiving servers how to handle messages failing authentication and providing reporting capabilities.
  • URL Extraction and Analysis: Phishing emails frequently contain links to fraudulent websites designed to steal credentials. This tool extracts URLs from email content for further examination or blacklisting.

  • Malware Scanning of Attachments via ClamAV: Attachments in phishing emails often carry malware payloads. Integrating ClamAV antivirus scanning detects known malware signatures within attachments, providing additional protection.

By integrating YARA for content-based detection, authentication protocol verification (SPF, DKIM, DMARC) to assess sender legitimacy, URL analysis, and malware scanning, this tool offers a comprehensive framework for automated phishing email detection. It assists cybersecurity professionals in rapidly identifying and mitigating phishing threats, reducing manual overhead and improving email security hygiene.


Executive Summary

Phishing emails represent a critical attack vector in cybersecurity, leveraging social engineering and email spoofing to compromise user credentials and deploy malware. This project delivers an automated phishing detection system that integrates multiple analytic layers to identify phishing attempts accurately and efficiently.

The tool combines:

  • YARA-based signature detection to identify known phishing patterns and suspicious content within email bodies and headers.
  • Verification of email authentication protocols (SPF, DKIM, DMARC) to detect spoofed or unauthorized senders, strengthening sender validation.
  • URL extraction from emails to pinpoint potentially malicious links for further scrutiny.
  • ClamAV antivirus scanning of attachments to detect embedded malware or malicious payloads.

Designed for ease of deployment and extensibility, the tool supports scanning individual email files or bulk directories. It outputs detailed detection reports highlighting phishing indicators, enabling rapid incident response and improved email security posture.

By automating complex phishing detection workflows, this system enhances operational efficiency for cybersecurity analysts and reduces the risk of successful phishing attacks in organizational environments.


Highlights

  • Comprehensive Phishing Detection: Combines YARA signature matching, email authentication checks, URL extraction, and malware scanning for robust phishing identification.
  • YARA Rules Integration: Leverages customizable YARA rulesets for precise detection of phishing-related keywords and patterns within emails.
  • SPF, DKIM, and DMARC Verification: Validates sender authenticity by analyzing standard email authentication protocols to detect spoofing attempts.
  • URL Extraction and Analysis: Automatically extracts URLs embedded in email bodies, aiding in the detection of malicious or fraudulent links.
  • Attachment Malware Scanning: Employs ClamAV to scan email attachments for viruses and malware, increasing detection coverage.
  • Batch Processing Support: Enables scanning of individual emails or entire directories for efficient bulk analysis.
  • Command-Line Interface: Simple, scriptable CLI allows easy integration into existing security workflows or automation pipelines.
  • Cross-Platform Compatibility: Designed for Linux environments, with dependencies managed in isolated Python virtual environments for stability.

Tech Stack

  • Python 3 — Primary programming language for email parsing, scanning logic, and automation.
  • YARA — Pattern matching engine used to define and apply phishing detection rules on email content.
  • ClamAV — Open-source antivirus engine employed to scan attachments for malware and malicious payloads.
  • email (Python standard library) — Parses .eml email files to extract headers, body text, and attachments.
  • re (Regular Expressions) — Utilized for URL extraction from email bodies through pattern matching.
  • SPF, DKIM, DMARC — Email authentication protocols verified programmatically to detect sender spoofing.
  • Virtual Environments (venv) — Used to isolate Python dependencies and maintain clean, reproducible environments.
  • Linux (Kali Linux preferred) — Target platform for running the tool, leveraging native support for security packages.

Installation Guide

Follow these steps to set up the phishing email detection tool on a Kali Linux or Debian-based system:

1. Clone the Repository

git clone https://github.com/yourusername/phish-tool.git
cd phish-tool

2. Create and Activate a Python Virtual Environment

Using a virtual environment keeps dependencies isolated and avoids conflicts with system packages:

python3 -m venv phishing-env
source phishing-env/bin/activate

3. Install Python Dependencies

Inside the activated environment, install the required Python libraries:

pip install -r requirements.txt

Ensure your requirements.txt includes packages like yara-python, virustotal-python (if used), and any others.

4. Install System Dependencies

Install required system packages using apt:

sudo apt update
sudo apt install yara clamav clamav-daemon

5. Update ClamAV Virus Definitions

Before scanning attachments, update the ClamAV database:

sudo freshclam

6. Verify Installation

Check that yara and clamav are installed correctly:

yara --version
clamscan --version

7. Ready to Use

You can now scan .eml files with:

python3 phish_detect.py sample_email.eml

Here’s a requirements.txt example for your project’s Python dependencies:

yara-python
virustotal-python

(Add any other Python libraries you use in your scripts)

Usage Examples

Scan a Single Email File

python3 phish_detect.py path/to/email_file.eml

This command will analyze the specified email file for phishing indicators, including:

  • YARA rule matches
  • URLs extracted from the email body
  • Malware scanning results on any attachments

Scan All Emails in a Directory

python3 phish_detect.py path/to/eml_directory/

The tool recursively scans all .eml files inside the given folder, reporting findings for each.


Sample Output:

Scanning sample_email.eml ...
No phishing YARA rules matched.
Extracted URLs:
  - http://malicious-link.com/login
Attachments scanned: 1 (No malware detected)

Configuration and Usage

Configuration

  • YARA Rules: Phishing detection relies on YARA rules defined in phishing_email_keywords.yar. You can customize or extend this file to add new phishing patterns, keywords, or suspicious email header indicators. To modify or add rules, follow the YARA documentation.

  • VirusTotal API Key (Optional): If integrated, the tool can query VirusTotal to analyze extracted URLs or attachments. Set your VirusTotal API key as an environment variable before running the tool:

    export VT_API_KEY="your_virustotal_api_key_here"
  • ClamAV Configuration: ClamAV runs as a daemon or via clamscan. Ensure the virus database is regularly updated with:

    sudo freshclam

Usage

  1. Activate the Python Virtual Environment:

    source phishing-env/bin/activate
  2. Run the Tool on a Single Email File:

    python3 phish_detect.py path/to/email.eml
  3. Run the Tool on a Directory of Emails:

    python3 phish_detect.py path/to/email_folder/
  4. Interpreting Output:

    • The tool reports matched YARA rules, extracted URLs, and any malware detections on attachments.
    • Email authentication results (SPF, DKIM, DMARC) are displayed to help identify spoofed senders.
    • URLs extracted can be manually or automatically checked against threat intelligence feeds or VirusTotal.

Working Principle

The phishing email detection tool operates through a multi-layered analysis pipeline designed to uncover various indicators of phishing attacks within raw email files (.eml). The key stages include:

1. Email Parsing

  • The tool uses Python's built-in email library to parse the .eml file.
  • It extracts critical components such as headers (From, To, Subject, Received), the email body (text and HTML), and attachments.
  • This decomposition allows targeted analysis of each part.

2. Signature-Based Detection with YARA

  • The email body and headers are scanned against a predefined set of YARA rules that encapsulate known phishing keywords, suspicious phrases, and behavioral patterns.
  • YARA's flexible pattern matching enables detection of sophisticated phishing tactics beyond simple keyword searches.
  • If any rules match, the tool flags the email as potentially phishing.

3. Email Authentication Validation

  • The tool programmatically checks SPF, DKIM, and DMARC authentication results embedded in the email headers or via DNS lookups.
  • These protocols help verify if the sender’s domain and IP address are authorized, and whether the email has been tampered with.
  • Failures or misalignments in these checks often indicate spoofed or fraudulent emails.

4. URL Extraction and Analysis

  • The tool extracts all URLs from the email body using regular expressions.
  • Extracted URLs are then listed for manual review or optionally sent to VirusTotal or other threat intelligence platforms for reputation checks.
  • Suspicious or obfuscated URLs are flagged as potential phishing links.

5. Attachment Malware Scanning

  • Attachments extracted from the email are scanned with ClamAV antivirus.
  • This step detects known malware signatures or suspicious payloads embedded within the attachments.
  • Clean attachments reduce the risk of malware infection from phishing campaigns.

6. Reporting

  • The tool consolidates findings from all stages into a clear, actionable report.
  • It highlights matched YARA rules, authentication results, suspicious URLs, and malware detection status.
  • This report enables rapid decision-making by security analysts.

Common Issues

1. ModuleNotFoundError for yara or Other Libraries

  • Cause: The Python YARA bindings or other dependencies are not installed in your active environment.

  • Solution:

    • Ensure you have activated your Python virtual environment:

      source phishing-env/bin/activate
    • Install YARA Python bindings inside the environment:

      pip install yara-python
    • If installation fails, verify system dependencies for YARA are installed:

      sudo apt install yara libyara-dev

2. ClamAV Fails to Scan or Reports Outdated Database

  • Cause: Virus signature database is missing or outdated.

  • Solution:

    • Update virus definitions before scanning attachments:

      sudo freshclam
    • Ensure ClamAV daemon is running if used:

      sudo systemctl start clamav-freshclam
      sudo systemctl enable clamav-freshclam

3. Permission Denied Errors When Running Scripts

  • Cause: Missing execute permissions or running commands without proper privileges.

  • Solution:

    • Add execute permissions to scripts:

      chmod +x phish_detect.py
    • Run commands with appropriate privileges or inside the virtual environment.

4. Email Parsing Errors

  • Cause: The input file is not a valid .eml format or is corrupted.

  • Solution:

    • Verify the email file is properly saved in standard .eml format.
    • Try opening the file with an email client to confirm validity.

5. VirusTotal API Key Not Recognized

  • Cause: Environment variable for API key not set or invalid key.

  • Solution:

    • Export the key before running the script:

      export VT_API_KEY="your_virustotal_api_key"
    • Verify your API key on VirusTotal and check quota limits.

6. No YARA Rules Matched Despite Suspected Phishing

  • Cause: YARA ruleset is outdated or incomplete.

  • Solution:

    • Regularly update or customize the phishing_email_keywords.yar file to include new phishing patterns.
    • Consider integrating threat intelligence feeds to enrich detection.

Virtual Setup

To ensure a clean, controlled environment and avoid conflicts with system packages, it is recommended to run the phishing email detection tool inside a Python virtual environment.

Step 1: Install Python 3 and venv (if not already installed)

sudo apt update
sudo apt install python3 python3-venv python3-pip

Step 2: Create a Virtual Environment

Navigate to your project directory and create a virtual environment named phishing-env:

python3 -m venv phishing-env

This command creates an isolated directory with its own Python interpreter and libraries.

Step 3: Activate the Virtual Environment

Activate the environment to use its Python interpreter and installed packages:

source phishing-env/bin/activate

Your shell prompt will change to indicate the active environment.

Step 4: Install Required Python Packages

With the virtual environment activated, install dependencies via pip:

pip install -r requirements.txt

This installs all Python packages needed for your tool, such as yara-python and virustotal-python.

Step 5: Run the Tool Inside the Virtual Environment

Now you can execute your phishing detection script safely isolated from system Python:

python3 phish_detect.py path/to/email.eml

Step 6: Deactivate the Virtual Environment

After finishing, exit the environment by running:

deactivate

Benefits of Using Virtual Environments

  • Isolation: Avoid dependency conflicts with other Python projects or system libraries.
  • Reproducibility: Easily share requirements.txt so others can replicate your setup.
  • Security: Limits changes to system Python, reducing risk of breaking core tools.

Getting Involved

We welcome contributions from the community to improve this phishing email detection tool. Here’s how you can get involved:

1. Report Issues

If you encounter bugs, errors, or unexpected behavior, please open an issue on the GitHub repository. Include as much detail as possible: error messages, environment info, and steps to reproduce.

2. Suggest Features

Have ideas to enhance detection accuracy, add new functionality, or improve usability? Submit feature requests via GitHub issues or discuss them in the project discussions tab.

3. Contribute Code

  • Fork the Repository: Create your own copy of the project repository on GitHub.
  • Create a Feature Branch: Work on your changes in a separate branch to keep development organized.
  • Follow Code Style: Maintain consistent coding standards and add comments where necessary.
  • Test Your Changes: Ensure new code is tested and does not break existing functionality.
  • Submit a Pull Request: Provide a clear description of the changes and their benefits.

4. Update YARA Rules

Phishing tactics evolve rapidly. Contributing new or improved YARA rules is one of the most valuable ways to enhance detection. Share updated rule files via pull requests.

5. Improve Documentation

Good documentation helps others understand and use the tool effectively. Contributions to README, usage guides, and troubleshooting docs are highly appreciated.


License

This project is licensed under the MIT License.

Summary of the MIT License

  • Permission: You are free to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the software.

  • Conditions: The above copyright notice and this permission notice must be included in all copies or substantial portions of the software.

  • Disclaimer: The software is provided “as is,” without warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose, and noninfringement.

Why MIT?

The MIT License is a permissive open source license that allows maximum freedom to use and contribute to the project, encouraging wide adoption while protecting the original authors.


About

Built an ML-powered phishing detection tool using Python, scikit-learn, and NLP to classify emails, achieving 90%+ accuracy in detecting malicious content and blocking phishing attempts in real time.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published