Real-time fake news detection system using machine learning and streaming data from Bluesky social network.
This project implements an end-to-end pipeline for detecting potentially fake news content in social media posts. It combines:
- Machine Learning: TF-IDF + Logistic Regression classifier trained on labeled news dataset
- Real-time Streaming: Consumes live posts from Bluesky Jetstream firehose
- Distributed Processing: Apache Kafka for messaging, Spark for stream processing
- Search & Analytics: OpenSearch for indexing and analyzing scored posts
- REST API: Flask-based prediction service for inference
Bluesky Jetstream → Kafka → Spark Streaming → OpenSearch
↓
ML Model (REST API)
Data Flow:
- Bluesky Producer subscribes to Jetstream firehose and publishes posts to Kafka
- Spark Consumer reads from Kafka, applies heuristic scoring, and indexes to OpenSearch
- ML API serves trained model for on-demand predictions
- Python 3.11+
- Apache Kafka
- Apache Spark 3.5+
- OpenSearch
- Docker (optional)
# Clone repository
git clone https://github.com/yourusername/FakeNewsDetection.git
cd FakeNewsDetection
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Configure environment
cp config/.env.example .env
# Edit .env with your configuration# Ensure fake_or_real_news.csv is in root directory
python src/ml/train_model.pyThis will create artifacts/model.joblib and artifacts/vectorizer.joblib.
1. Start Bluesky Producer (Kafka)
export KAFKA_BOOTSTRAP=localhost:19092
export TOPIC=posts_scored
python src/streaming/bluesky_producer.py2. Start Spark → OpenSearch Consumer
spark-submit \
--packages org.elasticsearch:elasticsearch-spark-30_2.12:8.11.0 \
src/streaming/spark_opensearch_sink.py3. Start Prediction API
python src/api/prediction_service.pyThe API will be available at http://localhost:5000
# Health check
curl http://localhost:5000/health
# Predict fake news probability
curl -X POST http://localhost:5000/predict \
-H "Content-Type: application/json" \
-d '{
"title": "Breaking: Miracle cure discovered!",
"text": "Click here for 100% guaranteed results!!!"
}'
# Response:
# {
# "prob_fake": 0.87,
# "pred_label": 1
# }FakeNewsDetection/
├── src/
│ ├── api/
│ │ └── prediction_service.py # Flask REST API
│ ├── ml/
│ │ └── train_model.py # Model training script
│ └── streaming/
│ ├── bluesky_producer.py # Kafka producer (Bluesky → Kafka)
│ └── spark_opensearch_sink.py # Spark consumer (Kafka → OpenSearch)
├── config/
│ └── .env.example # Environment configuration template
├── artifacts/ # Trained model artifacts (generated)
├── requirements.txt # Python dependencies
├── Dockerfile # Container image for API service
└── README.md
Key environment variables (see config/.env.example):
| Variable | Description | Default |
|---|---|---|
KAFKA_BOOTSTRAP |
Kafka broker address | localhost:19092 |
KAFKA_TOPIC |
Topic for scored posts | posts_scored |
JETSTREAM_URL |
Bluesky firehose endpoint | wss://jetstream1.us-east.bsky.network/subscribe |
OPENSEARCH_HOST |
OpenSearch host | 10.142.0.3 |
OPENSEARCH_INDEX |
Target index name | posts_scored |
- Algorithm: Logistic Regression with TF-IDF features
- Features: Unigrams + bigrams (max 50,000 features)
- Labels: Binary classification (0 = Real, 1 = Fake)
- Dataset:
fake_or_real_news.csvwith title, text, and label columns
Heuristic Scoring (used in streaming pipeline):
- Multiple exclamation marks
- Sensational keywords (e.g., "secreto", "milagro", "100% garantizado")
- URL presence
- Very short text length
# Build image
docker build -t fake-news-api .
# Run container
docker run -p 5000:5000 \
-v $(pwd)/artifacts:/app/artifacts \
fake-news-api- Valentina Morales Villada
- Nicolas Betancur Ochoa
- Sebastian Salazar Osorio
This project is licensed under the MIT License - see the LICENSE file for details.
This project was developed as part of coursework at EAFIT University.
Built with Python • Kafka • Spark • OpenSearch • Flask • scikit-learn