Fake News Detection Pipeline

Real-time fake news detection system using machine learning and streaming data from Bluesky social network.

🎯 Overview

This project implements an end-to-end pipeline for detecting potentially fake news content in social media posts. It combines:

Machine Learning: TF-IDF + Logistic Regression classifier trained on labeled news dataset
Real-time Streaming: Consumes live posts from Bluesky Jetstream firehose
Distributed Processing: Apache Kafka for messaging, Spark for stream processing
Search & Analytics: OpenSearch for indexing and analyzing scored posts
REST API: Flask-based prediction service for inference

📊 Architecture

Bluesky Jetstream → Kafka → Spark Streaming → OpenSearch
                      ↓
                  ML Model (REST API)

Data Flow:

Bluesky Producer subscribes to Jetstream firehose and publishes posts to Kafka
Spark Consumer reads from Kafka, applies heuristic scoring, and indexes to OpenSearch
ML API serves trained model for on-demand predictions

🚀 Quick Start

Prerequisites

Python 3.11+
Apache Kafka
Apache Spark 3.5+
OpenSearch
Docker (optional)

Installation

# Clone repository
git clone https://github.com/yourusername/FakeNewsDetection.git
cd FakeNewsDetection

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Configure environment
cp config/.env.example .env
# Edit .env with your configuration

Training the Model

# Ensure fake_or_real_news.csv is in root directory
python src/ml/train_model.py

This will create artifacts/model.joblib and artifacts/vectorizer.joblib.

Running the Pipeline

1. Start Bluesky Producer (Kafka)

export KAFKA_BOOTSTRAP=localhost:19092
export TOPIC=posts_scored
python src/streaming/bluesky_producer.py

2. Start Spark → OpenSearch Consumer

spark-submit \
  --packages org.elasticsearch:elasticsearch-spark-30_2.12:8.11.0 \
  src/streaming/spark_opensearch_sink.py

3. Start Prediction API

python src/api/prediction_service.py

The API will be available at http://localhost:5000

Using the API

# Health check
curl http://localhost:5000/health

# Predict fake news probability
curl -X POST http://localhost:5000/predict \
  -H "Content-Type: application/json" \
  -d '{
    "title": "Breaking: Miracle cure discovered!",
    "text": "Click here for 100% guaranteed results!!!"
  }'

# Response:
# {
#   "prob_fake": 0.87,
#   "pred_label": 1
# }

📁 Project Structure

FakeNewsDetection/
├── src/
│   ├── api/
│   │   └── prediction_service.py    # Flask REST API
│   ├── ml/
│   │   └── train_model.py           # Model training script
│   └── streaming/
│       ├── bluesky_producer.py      # Kafka producer (Bluesky → Kafka)
│       └── spark_opensearch_sink.py # Spark consumer (Kafka → OpenSearch)
├── config/
│   └── .env.example                 # Environment configuration template
├── artifacts/                       # Trained model artifacts (generated)
├── requirements.txt                 # Python dependencies
├── Dockerfile                       # Container image for API service
└── README.md

🔧 Configuration

Key environment variables (see config/.env.example):

Variable	Description	Default
`KAFKA_BOOTSTRAP`	Kafka broker address	`localhost:19092`
`KAFKA_TOPIC`	Topic for scored posts	`posts_scored`
`JETSTREAM_URL`	Bluesky firehose endpoint	`wss://jetstream1.us-east.bsky.network/subscribe`
`OPENSEARCH_HOST`	OpenSearch host	`10.142.0.3`
`OPENSEARCH_INDEX`	Target index name	`posts_scored`

🧪 Model Details

Algorithm: Logistic Regression with TF-IDF features
Features: Unigrams + bigrams (max 50,000 features)
Labels: Binary classification (0 = Real, 1 = Fake)
Dataset: fake_or_real_news.csv with title, text, and label columns

Heuristic Scoring (used in streaming pipeline):

Multiple exclamation marks
Sensational keywords (e.g., "secreto", "milagro", "100% garantizado")
URL presence
Very short text length

🐳 Docker Deployment

# Build image
docker build -t fake-news-api .

# Run container
docker run -p 5000:5000 \
  -v $(pwd)/artifacts:/app/artifacts \
  fake-news-api

👥 Authors

Valentina Morales Villada
Nicolas Betancur Ochoa
Sebastian Salazar Osorio

📚 Documentation

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

This project was developed as part of coursework at EAFIT University.

Built with Python • Kafka • Spark • OpenSearch • Flask • scikit-learn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fake News Detection Pipeline

🎯 Overview

📊 Architecture

🚀 Quick Start

Prerequisites

Installation

Training the Model

Running the Pipeline

Using the API

📁 Project Structure

🔧 Configuration

🧪 Model Details

🐳 Docker Deployment

👥 Authors

📚 Documentation

📄 License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
config		config
docs		docs
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

License

Sebasalazaro/FakeNewsDetection

Folders and files

Latest commit

History

Repository files navigation

Fake News Detection Pipeline

🎯 Overview

📊 Architecture

🚀 Quick Start

Prerequisites

Installation

Training the Model

Running the Pipeline

Using the API

📁 Project Structure

🔧 Configuration

🧪 Model Details

🐳 Docker Deployment

👥 Authors

📚 Documentation

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages