reddit_pipeline

A data engineering pipeline to extract information from reddit.

Workflow

Pull data from reddit using praw (json object)
Clean data and only keep relevant information
Load data into a mySQL relational database

Data Model

Improved Data Pipeline: Dealing with Big Data

When it comes to big data, we must redesign our pipeline to account for the massive amount of memory and data processing requirements. Therefore, I propose a new and improved pipeline that contains a 2-step ETL pipeline which first dumps raw data into a data lake. The second step then takes the data lake data and processes it, using some distributed environment tools such as Apache Beam and Google Dataflow. The database also needs to be upgraded to a distributed data warehouse such as Amazon Redshift or Google Bigquery.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
__pycache__		__pycache__
pics		pics
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
reddit_scraper.py		reddit_scraper.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

reddit_pipeline

Workflow

Data Model

Improved Data Pipeline: Dealing with Big Data

About

Uh oh!

Releases

Packages

Languages

Fmak95/reddit_pipeline

Folders and files

Latest commit

History

Repository files navigation

reddit_pipeline

Workflow

Data Model

Improved Data Pipeline: Dealing with Big Data

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages