A real-time web application that scrapes tweets via Selenium, streams them into Kafka, processes them with Apache Spark, and performs sentiment prediction using a pre-trained logistic regression model.
Note: The
sentiment_BERT/directory contains large model files (≈420 MB) and is not included in this repository.
Please download it manually before running the app. (https://drive.google.com/drive/folders/1RqGCpUjVUT0-F05LE1pulAeueogXflrS)
kafka_2.12-3.5.0/               # Kafka distribution
├── bin/                        # Kafka CLI scripts
├── config/
├── libs/
├── ...
static/
  └── style.css                 # CSS for the web UI
templates/
  └── index.html                # HTML template
app.py                          # Flask + Spark + Kafka integration
scraper.py                      # Selenium scraper & Kafka producer
sentiment_BERT/                 
├── config.json                        
├── model.safetensors
├── special_tokens_map.json
├── tokenizer_config.json
├── training_args.bin
└── vocab.txt
logreg_sentiment140_model.pkl   # Pre-trained sentiment model
README.md                       # This file
- Java 8+ (for Spark & Kafka)
 - Kafka & Zookeeper
 - Python 3.8+ and pip
 - Google Chrome (for Selenium)
 
pip install flask pyspark kafka-python selenium webdriver-manager joblib- 
Start Zookeeper & Kafka
# In one terminal bin/zookeeper-server-start.sh config/zookeeper.properties # In another bin/kafka-server-start.sh config/server.properties
 - 
Delete & Recreate Topic (optional, to purge old data)
# Delete existing topic bin/kafka-topics.sh \ --bootstrap-server localhost:9092 \ --delete --topic tweets # Recreate with 4 partitions bin/kafka-topics.sh \ --bootstrap-server localhost:9092 \ --create \ --topic tweets \ --partitions 4 \ --replication-factor 1
 - 
(Optional) View Topic Contents
bin/kafka-console-consumer.sh \ --bootstrap-server localhost:9092 \ --topic tweets \ --from-beginning
 - 
Run the App
python app.py
- The Flask server will launch on 
http://127.0.0.1:5000. - Use the Fetch Tweets button to start the scraper (opens headless Chrome, scrapes and pushes tweets to Kafka).
 - Stop Fetching Tweets stops the scraper process.
 - Start Prediction stops scraping, consumes all tweets from the topic, runs sentiment prediction, and displays results.
 
 - The Flask server will launch on 
 
scraper.py: Uses Selenium to scroll through Twitter search results (x.com) and produces tweet text messages into Kafka topictweets.app.py:- Spins up a SparkSession with the Kafka SQL connector.
 - Broadcasts a pre-trained logistic regression model for inference.
 - Provides Flask endpoints:
/fetch_tweets→ spawnsscraper.pyas a subprocess/stop_fetch→ terminates the scraper and fetches any remaining tweets/start_prediction→ reads entire topic fromearliestoffset, applies the model via a Spark UDF, and stores predictions/get_tweets→ returns a JSON list of{ tweet, prediction }
 
index.html+style.css: A simple UI with controls and a live table that polls/get_tweetsevery 2 seconds, animates new rows, and color‑codes predictions.
- Fetch Tweets → begins scraping & streaming to Kafka, table updates live.
 - Stop Fetching Tweets → stops scraper but table continues polling Kafka for completeness.
 - Start Prediction → stops polling (optionally), runs batch prediction over all messages, updates table cells with sentiment.