This project focuses on building a classification model to distinguish between abusive and non-abusive emails. With the increasing prevalence of online harassment and offensive communication, automated systems for identifying abusive content have become essential.
This project aims to develop a machine-learning model that accurately categorizes emails into abusive and non-abusive categories. This can be particularly useful for email providers, social media platforms, and other online communication platforms to filter out harmful content and ensure a safer environment for their users.
The dataset used for training and evaluation comprises a diverse collection of emails labeled as abusive or non-abusive. The dataset has been preprocessed to remove personally identifiable information and sensitive content.
The classification model is built using natural language processing (NLP) techniques and machine learning algorithms. The process involves:
- Data Preprocessing: Cleaning and tokenizing the text data, removing stopwords, and performing other necessary preprocessing steps.
- Feature Engineering: Extracting relevant features from the text data, such as TF-IDF vectors or word embeddings.
- Model Selection: Evaluating various classification algorithms such as Naive Bayes, SVM, and neural networks to determine the most effective approach.
- Training and Evaluation: Training the selected model on the labeled dataset and evaluating its performance using metrics such as accuracy, precision, recall, and F1-score.
- Deployment: Integrating the trained model into an application or service for real-time classification of incoming emails.
- Python 3.x
- scikit-learn
- NLTK
- Pandas
- NumPy
The performance of the model on the test dataset is as follows:
- Passive Aggressive Classifier--------->99.56%
- Naive Bayes--------------------------->97.10%
- TFIDF---------------------------------->99.61%
- TFIDF: Bigrams------------------------>99.71%
- TFIDF: Trigrams------------------------>99.71%
Contributions to this project are welcome. If you have any suggestions for improvements or would like to report issues, please submit a pull request or open an issue on GitHub.
This project is licensed under the MIT License - see the LICENSE file for details.