A music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app.
As their data engineer, you are tasked with building an ETL pipeline that extracts their data from S3, stages them in Redshift, and transforms data into a set of dimensional tables for their analytics team to continue finding insights in what songs their users are listening to. You'll be able to test your database and ETL pipeline by running queries given to you by the analytics team from Sparkify and compare your results with their expected results.
In this project, I have applied my knowledge on data warehouses and AWS to build an ETL pipeline for a database hosted on Redshift. I loaded the data from S3 to staging tables on Redshift and execute SQL statements that create the analytics tables from these staging tables.
- Schema for your fact and dimension tables is below in Star Schema 2NF
- songplays - records in event data associated with song plays i.e. records with page NextSong
- songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent
- users - users in the app
- user_id, first_name, last_name, gender, level
- songs - songs in music database
- song_id, title, artist_id, year, duration
- artists - artists in music database
- artist_id, name, location, lattitude, longitude
- time - timestamps of records in songplays broken down into specific units
- start_time, hour, day, week, month, year, weekday
- Wrote SQL CREATE statements for each of these tables in sql_queries.py
- Completed the logic in create_tables.py to connect to the database and create these tables
- Wrote SQL DROP statements to drop tables in the beginning of create_tables.py if the tables already exist. This way, I ran create_tables.py whenever I want to reset the database and test the ETL pipeline.
- Launched a redshift cluster and create an IAM role that has read access to S3.
- Added redshift database and IAM role info to dwh.cfg.
- Tested by running create_tables.py and checking the table schemas in your redshift database. You can use Query Editor in the AWS Redshift console for this.
- Implemented the logic in etl.py to load data from S3 to staging tables on Redshift.
- Implemented the logic in etl.py to load data from staging tables to analytics tables on Redshift.
- Tested by running etl.py after running create_tables.py and running the analytic queries on the Redshift database to compare the results with the expected results.
- Deleted the redshift cluster when finished.
The data-sources are provided by two S3 buckets
You need a AWS Redshift Cluster
up and running
And Python
- Redshift
dc2.large
cluster with 4 nodes was used with a costingUSD 0.25/h (on-demand option)
per cluster - We used
IAM role
authorization mechanism, - The only policy attached to this IAM is
AmazonS3ReadOnlyAccess
- Make sure sure Redshift is has public access and VPC Secuirty Group Access.
In termina, set your filesystem on project root folder
and insert the commands to run:
the Creating the Redshift tables
python create_tables.py
Next, insert data into the Redshift using the ETL Script
python etl.py
/
means its in the main folder:
- /Data - Folder to store data and ERD images
md
- create_tables.py - drops old tables (if exist) ad creates/recreates new tables
- etl.py - Inserts JSON data into the tables in Redshift from S3 t
- sql_queries.py - SQL statements using Python, for CREATE, DROP, COPY and INSERT tables
- dhw.cfg - Info about Redshift, IAM and S3 (hidden)