Skip to content

a machine learning based toxic comment classification system to detect and classify toxic texts, promoting healthy conversation by discouraging negative or profane language in chat

License

Notifications You must be signed in to change notification settings

yxshee/toxic-terminator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

78 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ Toxic Terminator: AI-Powered Toxicity Detection πŸ›‘οΈ

License: MIT Python 3.8+ Scikit-Learn

"Purifying Digital Spaces One Tweet at a Time" πŸ”βœ¨

πŸ“‹ Table of Contents

  1. πŸ“Œ Project Overview
  2. πŸ“Š Dataset Information
  3. 🧹 Data Preprocessing
  4. βš™οΈ Feature Extraction
  5. πŸ€– Model Training
  6. πŸ“ˆ Model Evaluation
  7. πŸ’» Installation
  8. 🚦 Usage
  9. πŸš€ Future Enhancements
  10. 🀝 Contributing
  11. πŸ“œ License
  12. πŸ™ Acknowledgements
  13. πŸš€ Deployment Instructions

πŸ“Œ Project Overview

Toxic Terminator is an ML-powered shield against online toxicity πŸ›‘οΈ. Our solution helps platforms:

βœ… Automatically flag harmful content
βœ… Improve community moderation
βœ… Protect user mental health
βœ… Maintain positive digital environments


πŸ“Š Dataset Information

πŸ”— Source

+ Kaggle Twitter Toxicity Dataset
- https://www.kaggle.com/datasets/ashwiniyer176/toxic-tweets-dataset

πŸ“¦ Dataset Structure

Column Type Description Example
Unnamed: 0 int64 Index column (removed) 0
Toxicity int64 Binary label (0/1) 1 (Toxic)
tweet object Tweet text content "@user This is offensive..."

πŸ“Š Class Distribution

print(df['Toxicity'].value_counts(normalize=True))
0    57.4% 🟒 (Non-Toxic)
1    42.6% πŸ”΄ (Toxic)

🧹 Data Preprocessing

πŸ”„ Cleaning Pipeline

  1. πŸ—‘οΈ Remove index column
  2. 🧼 Handle missing values
  3. βœ‚οΈ Text normalization:
    • Remove @mentions
    • Strip URLs
    • Eliminate special characters
    • Convert to lowercase
    • Remove stopwords

βš™οΈ Preprocessing Example

Input:
@user Check this link: http://example.com!!! #toxic

Output:
check link toxic


βš™οΈ Feature Extraction

TF-IDF Vectorization Settings

TfidfVectorizer(
    max_features=10000,       # 🎯 Top 10k terms
    ngram_range=(1, 2),       # πŸ”  Uni+Bigrams
    stop_words=stop_words     # 🚫 Filter common words
)

Feature Matrix

Dimension Training Shape Test Shape
TF-IDF Matrix (45396, 10000) (11349, 10000)

πŸ€– Model Training

Model Architecture

graph LR
A[Raw Text] --> B(TF-IDF Features)
B --> C{MultinomialNB}
C --> D[Toxicity Prediction]
Loading

πŸ‹οΈ Training Parameters

  • Algorithm: Multinomial Naive Bayes
  • Train Size: 45,396 samples (80%)
  • Test Size: 11,349 samples (20%)
  • Serialized As: toxicity_model.pkt

πŸ“ˆ Model Evaluation

πŸ“Š Performance Metrics

Metric Score Visual
Accuracy 95.2% 🟒🟒🟒🟒🟒🟒🟒🟒🟒🟒
Precision 92.7% πŸ”΅πŸ”΅πŸ”΅πŸ”΅πŸ”΅πŸ”΅πŸ”΅πŸ”΅πŸ”΅
Recall 91.3% 🟑🟑🟑🟑🟑🟑🟑🟑
F1 Score 92.0% 🟣🟣🟣🟣🟣🟣🟣🟣🟣
ROC AUC 0.9719 πŸ“ˆ (See curve below)

πŸ” Confusion Matrix

Predicted 🟒 Predicted πŸ”΄
Actual 🟒 9,823 526
Actual πŸ”΄ 465 535

πŸ’» Installation

Quick Start

# 1. Clone repository
git clone https://github.com/yxshee/toxic-terminator.git

# 2. Install dependencies
pip install -r requirements.txt

# 3. Run training
python notebooks/model.ipynb

🐳 Docker Setup

FROM python:3.8-slim
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
CMD ["python", "app.py"]

🚦 Usage

Real-Time Prediction

from toxic_detector import ToxicityClassifier

detector = ToxicityClassifier()
tweet = "@user You're completely worthless!"
result = detector.classify(tweet)

print(f"πŸ” Result: {result['label']} (Confidence: {result['probability']:.2%})")

Output:
πŸ” Result: Toxic (Confidence: 98.72%)


πŸš€ Future Enhancements

  • 🌐 Multilingual Support
  • 🧠 BERT/Transformer Integration
  • ⚑ Real-Time API
  • πŸ“± Mobile Integration
  • πŸ”„ Active Learning Pipeline

🀝 Contributing

First Time Contributing? πŸŽ‰ Here's How:

  1. 🌟 Star the Repository
  2. 🍴 Fork the Project
  3. 🌿 Create a Feature Branch
  4. πŸ’» Commit Changes
  5. πŸ”„ Push to Branch
  6. 🎯 Open Pull Request

πŸ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ™ Acknowledgements

Organization Contribution
Kaggle Logo Kaggle Dataset Provision
Workflow Scikit-learn ML Framework
Python Logo Python Core Language

πŸš€ Deployment Instructions

To deploy the project on Streamlit:

  1. Install the required dependencies:
    pip install -r requirements.txt
  2. Ensure that the model files (models/tf_idf.pkt and models/toxicity_model.pkt) are in the project directory.
  3. Launch the app with Streamlit:
    streamlit run app.py
  4. Open the URL provided by Streamlit (usually http://localhost:8501) in your browser.

Made with ❀️ by YXSHEE | πŸ›‘οΈ Keep Conversations Clean!

About

a machine learning based toxic comment classification system to detect and classify toxic texts, promoting healthy conversation by discouraging negative or profane language in chat

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published