A streaming ETL pipeline for RealTime Tweet Collection, Analysis and Reporting

Description

A basic ETL pipeline concept project which collects realtime tweets from the Twitter API mentioning or from six selected world leaders. Assigns a sentiment score to these tweets and posts the best scoring and worst scoring tweet per hour to a Slack channel via Slack Bot.

Tech Stack

Python 3.8 (See requirements.txt's for libraries)
MongoDB
Postgres SQL
Docker
Twitter API
Slack (Slackbot)

The Pipeline

The pipeline consists of 5 components:

Tweet Collector

A python script that uses the Tweepy library to listen to Twitter API streams in real time. If one of the listened tweets mention one of the chosen world leaders it is caught pre-processed (we are only interested in the original tweet text and get the original from retweets) and stored in MongoDB

MongoDB

MongoDB is our Data Lake, where we dump the collected tweets through the Tweet Collector. ETL Job reads from MongoDB and Tweet Collector writes to it

Postgres SQL

The Postgres SQL is our Data Warehouse. Where we store sentiment analysed and timestamped tweets. ETL Job writes to Postgres and Slackbot reads from it

ETL Job

A python script that reads the tweets at MongoDB, performs a sentiment analysis on them, then writes them to Postgres SQL DB with a timestamp

Slackbot

A python script that reads the highest and lowest ranking tweet in the Postgres SQL DB placed in the DB every hour and posts them to the selected Slack channel

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.idea		.idea
etl_job		etl_job
imgs		imgs
slackbot		slackbot
tweet_collector		tweet_collector
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A streaming ETL pipeline for RealTime Tweet Collection, Analysis and Reporting

Description

Tech Stack

The Pipeline

Tweet Collector

MongoDB

Postgres SQL

ETL Job

Slackbot

About

Releases

Packages

Languages

nhattan040102/Twitter_data_pipeline

Folders and files

Latest commit

History

Repository files navigation

A streaming ETL pipeline for RealTime Tweet Collection, Analysis and Reporting

Description

Tech Stack

The Pipeline

Tweet Collector

MongoDB

Postgres SQL

ETL Job

Slackbot

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages