A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.
As their data engineer, I am tasked with building an ETL pipeline that extracts their data from S3, stages them in Redshift, and transforms data into a set of dimensional tables for their analytics team to continue finding insights in what songs their users are listening to. The database and ETL pipeline can be tested by running provided queries from the analytics team from Sparkify to compare with their expected results.
-
S3 (Data storage
object storage)
Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. -
Redshift (Cloud Data warehouse with
columnar storage)
Amazon Redshift is the world's fastest and most popular cloud data warehouse today and gets dramatically faster every year.
The image below shows the architchure used in this project
The image below shows the data model used in this project for the analytics (star schema) as well as the data staging
Run the scripts below in the described order to run this project.
- Create the redshift cluster by running the
Create Clusterscript - Wait for the cluster to be up and healthy. Log into your
AWS console to verify - Run the
Open TCP Portscript to open a port to allow access to redshift. This scipt also has utility that describes the cluster details like endpoint. - Create all tables in redshift for both the analytics tables and the staging tables.
Create tablescript contains all the necessary functions for creating and dropping tables. - Load data into the tables by running the
ETLscript - Go to redshift console and click
query editor. Underschema, selectPublicand we should see all the tables we just created.
- The
SQL Queriesscript contains all the scripts for creating, dropping and loading tables with data. Remember to delete the redshift clusterby running theDelete Clusterscript. Failure to do so will lead to significant charges from AWS.- A sample
configirationfile is included. This file details should be filled out and the name should be changed fromsample_dwh.cfgtodwh.cfg - Data sources for this project are provided. See the
Datasection.

