Skip to content

Real-time fake news detection using ML and streaming from Bluesky with Kafka, Spark, and OpenSearch

License

Notifications You must be signed in to change notification settings

Sebasalazaro/FakeNewsDetection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fake News Detection Pipeline

Real-time fake news detection system using machine learning and streaming data from Bluesky social network.


Python Flask Apache Kafka Apache Spark OpenSearch scikit-learn Docker


🎯 Overview

This project implements an end-to-end pipeline for detecting potentially fake news content in social media posts. It combines:

  • Machine Learning: TF-IDF + Logistic Regression classifier trained on labeled news dataset
  • Real-time Streaming: Consumes live posts from Bluesky Jetstream firehose
  • Distributed Processing: Apache Kafka for messaging, Spark for stream processing
  • Search & Analytics: OpenSearch for indexing and analyzing scored posts
  • REST API: Flask-based prediction service for inference

📊 Architecture

Bluesky Jetstream → Kafka → Spark Streaming → OpenSearch
                      ↓
                  ML Model (REST API)

Data Flow:

  1. Bluesky Producer subscribes to Jetstream firehose and publishes posts to Kafka
  2. Spark Consumer reads from Kafka, applies heuristic scoring, and indexes to OpenSearch
  3. ML API serves trained model for on-demand predictions

🚀 Quick Start

Prerequisites

  • Python 3.11+
  • Apache Kafka
  • Apache Spark 3.5+
  • OpenSearch
  • Docker (optional)

Installation

# Clone repository
git clone https://github.com/yourusername/FakeNewsDetection.git
cd FakeNewsDetection

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Configure environment
cp config/.env.example .env
# Edit .env with your configuration

Training the Model

# Ensure fake_or_real_news.csv is in root directory
python src/ml/train_model.py

This will create artifacts/model.joblib and artifacts/vectorizer.joblib.

Running the Pipeline

1. Start Bluesky Producer (Kafka)

export KAFKA_BOOTSTRAP=localhost:19092
export TOPIC=posts_scored
python src/streaming/bluesky_producer.py

2. Start Spark → OpenSearch Consumer

spark-submit \
  --packages org.elasticsearch:elasticsearch-spark-30_2.12:8.11.0 \
  src/streaming/spark_opensearch_sink.py

3. Start Prediction API

python src/api/prediction_service.py

The API will be available at http://localhost:5000

Using the API

# Health check
curl http://localhost:5000/health

# Predict fake news probability
curl -X POST http://localhost:5000/predict \
  -H "Content-Type: application/json" \
  -d '{
    "title": "Breaking: Miracle cure discovered!",
    "text": "Click here for 100% guaranteed results!!!"
  }'

# Response:
# {
#   "prob_fake": 0.87,
#   "pred_label": 1
# }

📁 Project Structure

FakeNewsDetection/
├── src/
│   ├── api/
│   │   └── prediction_service.py    # Flask REST API
│   ├── ml/
│   │   └── train_model.py           # Model training script
│   └── streaming/
│       ├── bluesky_producer.py      # Kafka producer (Bluesky → Kafka)
│       └── spark_opensearch_sink.py # Spark consumer (Kafka → OpenSearch)
├── config/
│   └── .env.example                 # Environment configuration template
├── artifacts/                       # Trained model artifacts (generated)
├── requirements.txt                 # Python dependencies
├── Dockerfile                       # Container image for API service
└── README.md

🔧 Configuration

Key environment variables (see config/.env.example):

Variable Description Default
KAFKA_BOOTSTRAP Kafka broker address localhost:19092
KAFKA_TOPIC Topic for scored posts posts_scored
JETSTREAM_URL Bluesky firehose endpoint wss://jetstream1.us-east.bsky.network/subscribe
OPENSEARCH_HOST OpenSearch host 10.142.0.3
OPENSEARCH_INDEX Target index name posts_scored

🧪 Model Details

  • Algorithm: Logistic Regression with TF-IDF features
  • Features: Unigrams + bigrams (max 50,000 features)
  • Labels: Binary classification (0 = Real, 1 = Fake)
  • Dataset: fake_or_real_news.csv with title, text, and label columns

Heuristic Scoring (used in streaming pipeline):

  • Multiple exclamation marks
  • Sensational keywords (e.g., "secreto", "milagro", "100% garantizado")
  • URL presence
  • Very short text length

🐳 Docker Deployment

# Build image
docker build -t fake-news-api .

# Run container
docker run -p 5000:5000 \
  -v $(pwd)/artifacts:/app/artifacts \
  fake-news-api

👥 Authors

  • Valentina Morales Villada
  • Nicolas Betancur Ochoa
  • Sebastian Salazar Osorio

📚 Documentation

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

This project was developed as part of coursework at EAFIT University.


Built with Python • Kafka • Spark • OpenSearch • Flask • scikit-learn

About

Real-time fake news detection using ML and streaming from Bluesky with Kafka, Spark, and OpenSearch

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •