Fake News Detection with DistilBERT

A machine learning project that uses DistilBERT to classify news articles as fake or real with high accuracy and efficiency.

🎯 Project Overview

This project implements a fake news detection system using DistilBERT, a lightweight version of BERT that maintains 97% of BERT's performance while being 60% smaller and 60% faster. The model is fine-tuned on a dataset of fake and real news articles to achieve reliable classification.

📊 Dataset

The project uses the Fake and Real News Dataset from Kaggle, which contains:

Fake news articles: Unreliable news articles from various sources
Real news articles: Legitimate news articles from Reuters
Features: Article text, subject, and publication date

🔧 Features

Efficient Model: Uses DistilBERT for faster training and inference
Comprehensive Evaluation: Includes accuracy, precision, recall, F1-score, and confusion matrix
Cross-Validation: 3-fold cross-validation for robust performance assessment
Model Persistence: Saves trained model and tokenizer for deployment
Visualization: Confusion matrix heatmap for performance analysis

🚀 Getting Started

Prerequisites

pip install transformers torch scikit-learn pandas numpy matplotlib seaborn

Installation

Clone the repository:

git clone https://github.com/yourusername/BERT_model.git
cd BERT_model

Download the dataset:
- Download the Fake and Real News Dataset from Kaggle
- Place Fake.csv and True.csv in the project root directory
Run the Jupyter notebook:

jupyter notebook "BERT copy.ipynb"

📈 Model Performance

The model achieves excellent performance on fake news detection:

Architecture: DistilBERT (distilbert-base-uncased)
Task: Binary classification (Fake vs Real)
Sequence Length: 128 tokens (optimized for speed)
Batch Size: 16 (training and evaluation)
Epochs: 2 (sufficient for convergence)

Key Metrics

High accuracy scores across validation sets
Balanced precision and recall
Robust performance confirmed through cross-validation
Fast inference suitable for real-time applications

🏗️ Project Structure

BERT_model/
├── BERT copy.ipynb          # Main notebook with complete pipeline
├── BERT.ipynb               # Alternative notebook version
├── truthguard.ipynb         # Additional experiments
├── README.md                # Project documentation
├── requirements.txt         # Python dependencies
├── Fake.csv                 # Fake news dataset (download separately)
├── True.csv                 # Real news dataset (download separately)
├── fake_news_model/         # Model checkpoints during training
├── results/                 # Training results and metrics
├── config.json              # Model configuration
├── tokenizer_config.json    # Tokenizer configuration
├── model_info.json          # Model metadata
└── *.bin, *.pt, *.json      # Model weights and configurations

🔄 Pipeline Overview

Data Loading: Load fake and real news datasets
Preprocessing: Clean text, remove irrelevant columns, assign labels
Train/Test Split: 80/20 split for training and validation
Model Setup: Initialize DistilBERT model and tokenizer
Tokenization: Convert text to BERT-compatible tokens
Training: Fine-tune model with appropriate hyperparameters
Evaluation: Comprehensive metrics calculation
Visualization: Generate confusion matrix
Cross-Validation: 3-fold CV for robustness testing
Model Saving: Persist model for deployment

🎨 Visualization

The notebook includes:

Confusion matrix heatmap showing prediction accuracy
Training progress visualization
Performance metrics comparison

🔮 Future Enhancements

Experiment with other BERT variants (RoBERTa, ALBERT)
Implement ensemble methods
Add data augmentation techniques
Create a web interface for real-time prediction
Implement automated model retraining
Add support for multiple languages

📝 Usage Example

# Load the saved model
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("./saved_model")
model = AutoModelForSequenceClassification.from_pretrained("./saved_model")

# Predict on new text
text = "Your news article text here"
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
outputs = model(**inputs)
prediction = outputs.logits.argmax(-1).item()

# 0 = Real News, 1 = Fake News
result = "Fake News" if prediction == 1 else "Real News"

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

📜 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Hugging Face for the Transformers library
The creators of the Fake and Real News Dataset
The BERT and DistilBERT research teams

📚 References

⭐ If you found this project helpful, please give it a star!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Fake News Detection with DistilBERT

🎯 Project Overview

📊 Dataset

🔧 Features

🚀 Getting Started

Prerequisites

Installation

📈 Model Performance

Key Metrics

🏗️ Project Structure

🔄 Pipeline Overview

🎨 Visualization

🔮 Future Enhancements

📝 Usage Example

🤝 Contributing

📜 License

🙏 Acknowledgments

📚 References

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
fake_news_model		fake_news_model
results		results
.gitattributes		.gitattributes
.gitignore		.gitignore
BERT copy.ipynb		BERT copy.ipynb
BERT.ipynb		BERT.ipynb
Fake.csv		Fake.csv
LICENSE		LICENSE
README.md		README.md
True.csv		True.csv
config.json		config.json
model_info.json		model_info.json
predict_example.py		predict_example.py
requirements.txt		requirements.txt
special_tokens_map.json		special_tokens_map.json
tokenizer.json		tokenizer.json
tokenizer_config.json		tokenizer_config.json
training_args.bin		training_args.bin
truthguard.ipynb		truthguard.ipynb
vocab.txt		vocab.txt

License

Rodwanbagdadi/BERT_model

Folders and files

Latest commit

History

Repository files navigation

Fake News Detection with DistilBERT

🎯 Project Overview

📊 Dataset

🔧 Features

🚀 Getting Started

Prerequisites

Installation

📈 Model Performance

Key Metrics

🏗️ Project Structure

🔄 Pipeline Overview

🎨 Visualization

🔮 Future Enhancements

📝 Usage Example

🤝 Contributing

📜 License

🙏 Acknowledgments

📚 References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages