Skip to content

The purpose of this project is to develop and compare two machine learning models to detect spam emails. Spam detection is a crucial task in email filtering systems to protect users from unwanted and potentially harmful emails. The project involves using a dataset containing various features extracted from email content.

Notifications You must be signed in to change notification settings

jmarihawkins/Spam-Email-Detection-ML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Overview

The purpose of this project is to develop and compare two machine learning models to detect spam emails. Spam detection is a crucial task in email filtering systems to protect users from unwanted and potentially harmful emails. The project involves using a dataset containing various features extracted from email content, such as word and character frequencies. By training and evaluating two different models, Logistic Regression and Random Forest Classifier, we aim to determine which model performs better in identifying spam emails.

Functionality

1. Data Import and Preprocessing

  • Import Data:

    • Read the CSV file containing the dataset using Pandas.
    • Display the first few rows to confirm successful import.
  • Data Splitting:

    • Create labels (y) representing whether an email is spam or not.
    • Create features DataFrame (X) by excluding the label column.
    • Split the dataset into training and testing sets using train_test_split.
  • Data Scaling:

    • Use StandardScaler to scale the feature data.
    • Fit the scaler on the training data and transform both training and testing data.

2. Model Training and Evaluation

  • Logistic Regression:

    • Create a Logistic Regression model.
    • Fit the model on the scaled training data.
    • Make predictions on the test data.
    • Calculate and print the accuracy score.
  • Random Forest Classifier:

    • Create a Random Forest Classifier model.
    • Fit the model on the scaled training data.
    • Make predictions on the test data.
    • Calculate and print the accuracy score.

Summary

In this project, we successfully developed and compared two machine learning models to detect spam emails. The models used were Logistic Regression and Random Forest Classifier. Our results showed that the Random Forest Classifier outperformed the Logistic Regression model in terms of accuracy, achieving a higher testing accuracy score.

Relevancy of Results:

  • The Random Forest Classifier's superior performance highlights its ability to handle complex and non-linear relationships within the data, making it a robust choice for spam detection.
  • Logistic Regression, while simpler, provided good results but was less effective in capturing the intricate patterns present in the dataset.

Real-World Application:

  • Beyond email spam detection, similar approaches can be applied to credit card fraud detection. In this context, machine learning models can be trained to identify fraudulent transactions based on features such as transaction amount, time, location, and user behavior patterns. By accurately detecting fraudulent activities, financial institutions can protect their customers and reduce losses.

Overall, this project demonstrates the importance of choosing the right model for specific tasks and highlights the effectiveness of ensemble methods like Random Forests in handling complex datasets.

About

The purpose of this project is to develop and compare two machine learning models to detect spam emails. Spam detection is a crucial task in email filtering systems to protect users from unwanted and potentially harmful emails. The project involves using a dataset containing various features extracted from email content.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published