Skip to content

dakotacbrown/Reddit-Data-Pipeline

Repository files navigation

Data Engineering Project - r/place

This is the repository for Dakota Brown's Data Engineering Personal Project based on Reddit's r/place data

Place 2017 Place 2022

Prerequisites

  1. docker (docker-compose will be needed as well).
  2. AWS account to set up cloud services.
  3. Install AWS CLI on an EC2 instance.
  4. Configure AWS CLI on an EC2 instance.
  5. Visualization Data visualization can be found here.
  6. Slide Deck Slide deck explaining the project can be found here.

Design

ETL Design

Entity Relationship Diagram

Entity Relationship Diagram

Data

Reddit r/place 2017 and 2022 data

Setup and run

If this is your first time using AWS, make sure to check for the IAM roles EMR_EC2_DefaultRole and EMR_DefaultRole.

aws iam list-roles | grep 'EMR_DefaultRole\|EMR_EC2_DefaultRole'
# "RoleName": "EMR_DefaultRole",
# "RoleName": "EMR_EC2_DefaultRole",

If the roles not present, create them using the following command

aws emr create-default-roles

Create an S3 bucket and load the scripts (located in code) into into a folder named scripts. Create a raw and transformed folder as well.

To start up airflow on your EC2 instance:

docker-compose -f docker-compose-LocalExecutor.yml up -d

(You can exchange LocalExecutor for CeleryExecutor as well)

Remove -d to see everything start up and view any errors if needed.

go to http://localhost:8080/admin/ and turn on the reddit_dag DAG. You can check the status at http://localhost:8080/admin/airflow/graph?dag_id=reddit_dag.

DAG

In EC2, make sure you're able to access the port airflow is bound to. The photo below helped me, however you would have to allow public traffic to EMR or it would block the creation of the an EMR instance from EC2.

Airflow fix.

Terminate local instance

docker-compose -f docker-compose-LocalExecutor.yml down

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published