This is the repository for Dakota Brown's Data Engineering Personal Project based on Reddit's r/place data
- docker (docker-compose will be needed as well).
- AWS account to set up cloud services.
- Install AWS CLI on an EC2 instance.
- Configure AWS CLI on an EC2 instance.
- Visualization Data visualization can be found here.
- Slide Deck Slide deck explaining the project can be found here.
Reddit r/place 2017 and 2022 data
If this is your first time using AWS, make sure to check for the IAM roles EMR_EC2_DefaultRole
and EMR_DefaultRole
.
aws iam list-roles | grep 'EMR_DefaultRole\|EMR_EC2_DefaultRole'
# "RoleName": "EMR_DefaultRole",
# "RoleName": "EMR_EC2_DefaultRole",
If the roles not present, create them using the following command
aws emr create-default-roles
Create an S3 bucket and load the scripts (located in code) into into a folder named scripts. Create a raw and transformed folder as well.
To start up airflow on your EC2 instance:
docker-compose -f docker-compose-LocalExecutor.yml up -d
(You can exchange LocalExecutor for CeleryExecutor as well)
Remove -d
to see everything start up and view any errors if needed.
go to http://localhost:8080/admin/ and turn on the reddit_dag
DAG. You can check the status at http://localhost:8080/admin/airflow/graph?dag_id=reddit_dag.
In EC2, make sure you're able to access the port airflow is bound to. The photo below helped me, however you would have to allow public traffic to EMR or it would block the creation of the an EMR instance from EC2.
docker-compose -f docker-compose-LocalExecutor.yml down