Skip to content

bchen-yyc/aws_data_warehouse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Data Warehouse On AWS V1.0

Background

A startup called Sparkify wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. The analytics team is particularly interested in understanding what songs users are listening to. They decide to implement a data warehouse on AWS cloud.

Schema

The data resides in S3. The ETL pipeline extracts the data from S3 and stages them in Redshift, transform them and load them into a star schema optimized for queries on song play analysis. This includes the following tables.

Stage Table

  1. StagEvents - load data directly from log data folder using COPY clouse
    • artist, auth, firstName, gender, itemInSession, lastName, length, level,
    • location, method, page, registration, sessionId, song, stats, ts, userAgent, userId
  2. StagSongs - load data directly from songs data folder using COPY clouse
    • songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent

Fact Table

  1. songplays - records in log data associated with song plays i.e. records with page NextSong
    • num_songs, artist_id, artist_latitude, artist_longitude, artist_location,
    • artist_name, song_id, title, duration, year

Dimension Tables

  1. users - users in the app
    • ser_id, first_name, last_name, gender, level
  2. songs - songs in music database
    • song_id, title, artist_id, year, duration
  3. artists - artists in music database
    • artist_id, name, location, latitude, longitude
  4. time - timestamps of records in songplays broken down into specific units
    • start_time, hour, day, week, month, year, weekday

Databset

Song Dataset

The song dataset is place in S3 bucket s3://udacity-dend/song_data. It is a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset.

song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json

And below is an example of what a single song file, TRAABJL12903CDCF1A.json, looks like

{"num_songs": 1, "artist_id": "ARJIE2Y1187B994AB7", "artist_latitude": null, "artist_longitude": null, "artist_location": "", "artist_name": "Line Renaud", "song_id": "SOUPIRU12A6D4FA1E1", "title": "Der Kleine Dompfaff", "duration": 152.92036, "year": 0}
Log Dataset

The log dataset is place in S3 bucket s3://udacity-dend/log_data.This dataset consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations.

The log files in the dataset you'll be working with are partitioned by year and month. For example, here are filepaths to two files in this dataset.

log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json

And below is an example of what the data in a log file, 2018-11-12-events.json, looks like. log data

Because the json file contains a list of json records, for the COPY clause to work properly, a json path file represent the json schema is prepared in the S3 bucket s3://udacity-dend/log_json_path.json.

ETL Process

The database is implemented with PostgreSQL queries, and the data process part was completed using Python. Run create_tables.py to create all the tables. StagEvents and StagSongs StagSongs table is loaded directly from song_data bucket with COPY clause. StagEvents table is loaded directly from log_data bucket. Since json file contains a list of json records, a json path is provided in a seperate bucket. DimSongs and DimArtists tables Use INSERT INTO target_table SELECT FROM source_table method to load data from StagSongs table. DimTime table Use EXTRACT method to extract hour, day and other parameters from ts column in StagEvents table. DimUsers table Use INSERT INTO target_table SELECT FROM source_table method to load data from StagEvents table. FactSongPlays table The log data only contain song name, artist name and song length. To get the songid and artistid, the StagEvents table is left joined with StagSongs table using song name, artist name and song length, and the results are inserted to FactSongPlays table.

Files in repository

dwh.cfg file contains all the parameters nessery for the cluster. Make sure you fill in your own RedShift cluster for HOST and IAM role for ARN. sql_queries.py file contains all the sql queries for create, insert and drop all the tables, it will be imported to create_tables.py and etl.py file. create_tables.py contains the code to create all the tables, and it can be run using python create_tables.py method in project folder. etl.py contains all the code for the etl process, and it can be run using python create_tables.py method in project folder. test.ipynb file can be run by jupyter notebook to query all the generated tables.

Further plan

Implement new design to incrementally load data from S3 to Redshift.

About

A data warehouse implementation on AWS platform

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published