Cloud Data Warehouse

Song Play Analysis With S3 and Redshift

As their data engineer, I am tasked with building an ETL pipeline that extracts their data from S3, stages them in Redshift, and transforms data into a set of dimensional tables for their analytics team to continue finding insights in what songs their users are listening to. The database and ETL pipeline can be tested by running provided queries from the analytics team from Sparkify to compare with their expected results.

Prerequisites

Resources

The project will use the following resources in AWS

S3 (Data storage object storage)
Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance.
Redshift (Cloud Data warehouse with columnar storage)
Amazon Redshift is the world's fastest and most popular cloud data warehouse today and gets dramatically faster every year.

Data

Data details can be found by using the link below

Data

Architecture

The image below shows the architchure used in this project

Data Model

The image below shows the data model used in this project for the analytics (star schema) as well as the data staging

How to run this project

Run the scripts below in the described order to run this project.

Create the redshift cluster by running the Create Cluster script
Wait for the cluster to be up and healthy. Log into your AWS console to verify
Run the Open TCP Port script to open a port to allow access to redshift. This scipt also has utility that describes the cluster details like endpoint.
Create all tables in redshift for both the analytics tables and the staging tables. Create table script contains all the necessary functions for creating and dropping tables.
Load data into the tables by running the ETL script
Go to redshift console and click query editor. Under schema, select Public and we should see all the tables we just created.

Note

The SQL Queries script contains all the scripts for creating, dropping and loading tables with data.
Remember to delete the redshift cluster by running the Delete Cluster script. Failure to do so will lead to significant charges from AWS.
A sample configiration file is included. This file details should be filled out and the name should be changed from sample_dwh.cfg to dwh.cfg
Data sources for this project are provided. See the Data section.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
img		img
.gitignore		.gitignore
README.md		README.md
create_redshift_cluster.py		create_redshift_cluster.py
create_tables.py		create_tables.py
data_description.md		data_description.md
delete_redshift_cluster.py		delete_redshift_cluster.py
describe_redshift.py		describe_redshift.py
etl.py		etl.py
open_tcp_port.py		open_tcp_port.py
sample_dwh.cfg		sample_dwh.cfg
sql_queries.py		sql_queries.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Cloud Data Warehouse

Song Play Analysis With S3 and Redshift

Contents

Summary

Prerequisites

Resources

The project will use the following resources in AWS

Data

Data details can be found by using the link below

Architecture

Data Model

How to run this project

Note

About

Uh oh!

Releases

Packages

Languages

Uh oh!

Uh oh!

okeuguru/Cloud-Data-Warehouse

Folders and files

Latest commit

History

Repository files navigation

Cloud Data Warehouse

Song Play Analysis With S3 and Redshift

Contents

Summary

Prerequisites

Resources

The project will use the following resources in AWS

Data

Data details can be found by using the link below

Architecture

Data Model

How to run this project

Note

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages