postgreSQL_data_modeling

Data modeling a Spotify-like music database using postgreSQL within a Python wrapper.

Background

This project creates an ETL pipeline that makes song data available for the analytics team at the startup Sparkify to understand what songs users are listening to. Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app. This project creates a Postgres database with tables designed to optimize queries on song play analysis.

Data

Song Dataset - Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset:

song_data/A/B/C/TRABCEI128F424C983.json song_data/A/A/B/TRAABJL12903CDCF1A.json

Below is an example of a single song file:

{"num_songs": 1, 
"artist_id" : "ARJIE2Y1187B994AB7", 
"artist_latitude": null, 
"artist_longitude": null, 
"artist_location": "", 
"artist_name": "Line Renaud", 
"song_id": "SOUPIRU12A6D4FA1E1", 
"title": "Der Kleine Dompfaff", 
"duration": 152.92036, 
"year": 0}

Log Dataset - This data is based on simulated activity logs from a music streaming app based on specified configurations.

The log files in this dataset partitioned by year and month. For example, here are filepaths to two files in this dataset:

log_data/2018/11/2018-11-12-events.json log_data/2018/11/2018-11-13-events.json

To read one of these json files into a Pandas DataFrame, use a query such as this:

df = pd.read_json('data/log_data/2018/11/2018-11-01-events.json', lines=True)

Database Schema - The schema consists of one fact table and four dimension tables.

This project uses a star schema, where one fact table songplays contains data on song play events. This fact table is referenced to 4 dimension tables.

Why use a star schema for this project?

Complexity of data - the data is not complex or large
Joins - necessary for the requested queries
Structured data - data is in predictable JSON file format

Fact Table

songplays - records in log data associated with song plays i.e. records with page NextSong

songplay_id SERIAL PRIMARY KEY: ID of each user's song play
start_time TIME: timestamp of start of user's activity
user_id VARCHAR NOT NULL: user ID
level VARCHAR: user level, FREE or PAID
song_id VARCHAR: song ID
artist_id VARCHAR: artist ID
session_id VARCHAR: session ID
location VARCHAR: user location
user_agent VARCHAR: agent used by user to access Sparkify

Dimension Tables

users - the users in the app

user_id PRIMARY KEY NOT NULL: user ID
first_name VARCHAR: user first name
last_name VARCHAR: user last name
gender VARCHAR: user last name
level VARCHAR: user level, FREE or PAID

songs - songs in music database

song_id VARCHAR PRIMARY KEY NOT NULL: song ID
title VARCHAR NOT NULL: song title
artist_id VARCHAR: artist ID
year INT: year of song release
duration float: duration of the song in seconds

artists - artists in music database

artist_id VARCHAR PRIMARY KEY NOT NULL: artist ID
name VARCHAR: artist name
location: artist location
latitude: location latitude
longitude: location longitude

time - timestamps of records in songplays broken down into specific units

start_time TIME PRIMARY KEY NOT NULL: start time
hour NUMERIC: hour
day NUMERIC: day
week NUMERIC: week
month VARCHAR: month
year INT: year
weekday INT: day of the week

Project File Structure

data folder - contains the song and log datasets
sql_queries.py contains all sql queries, and is imported into create_tables.py, etl.ipynb, and etl.py
create_tables.py drops and creates your tables. Run this file to reset your tables before each time you run your ETL scripts.
test.ipynb displays the first few rows of each table to let you check your database
etl.ipynb reads and processes a single file from song_data and log_data and loads the data into your tables. This notebook contains detailed instructions on the ETL process for each of the tables.
etl.py reads and processes files from song_data and log_data and loads them into your tables. You can fill this out based on your work in the ETL notebook.
README.md provides discussion on the project.

Steps to Follow

A. Creating Tables

Write DROP and CREATE table statements in sql_queries.py.
Ran !python create_tables.py in the console.
Ran create_tables.py to create the database and tables.
Ran test.ipynb to confirm the creation of tables with the correct columns. Note: Make sure to click "Restart kernel" to close the connection to the database after running this notebook.

B. Build and Implement the ETL Pipeline

Followed the instructions in etl.ipynb to develop the ETL process for each table before completing the etl.py file to load the whole datasets.
Transferred these steps into etl.py.
Ran !python etl.py in the console.
Verify the data was inserted into the tables with test.ipynb.

ETL Pipeline Overview

Connect to the Sparkify database.
Process song data. Use the get_files function to get a list of all song JSON files in data/song_data. 2a. Extract data for the Songs table and Artists table from the song_data file.
Process log data. Use the get_files function to get a list of all song JSON files in data/log_data. 3a. Extract data for the time table, users table, and songplays table from the log file. Note: time table is in milliseconds. Since the log file does not specify an ID for either the song or the artist, you'll need to get the song ID and artist ID by querying the songs and artists tables to find matches based on song title, artist name, and song duration time.
Insert the extracted data using etl.py. 4a. def process_song_file(cur, filepath): Extracts the song files. 4b. def process_log_file(cur, filepath): Extracts the log files. 4c. def process_data(cur, conn, filepath, func): Processes the data extracted from 4a. and 4b. 4d. def main(): Calls the process_data function using process_song_file() and process_log_file().
Close connection to the Sparkify database.

Issues faced and solutions

Conflicting values for user_ID and artist_ID when inserting data. Solution: Add ON CONFLICT (artist_id) DO NOTHING. For example: artist_table_insert = (""" INSERT INTO artists (artist_id, name, location, latitude, longitude) VALUES(%s, %s, %s, %s, %s) ON CONFLICT (artist_id) DO NOTHING ; """)
Convert the ts timestamp column to datetime Solution found here: https://datascience.stackexchange.com/questions/14645/convert-a-pandas-column-of-int-to-timestamp-datatype

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

postgreSQL_data_modeling

Background

Data

Database Schema - The schema consists of one fact table and four dimension tables.

Fact Table

Dimension Tables

Project File Structure

Steps to Follow

A. Creating Tables

B. Build and Implement the ETL Pipeline

ETL Pipeline Overview

Issues faced and solutions

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Database Schema.png		Database Schema.png
LICENSE		LICENSE
README.md		README.md
create_tables.py		create_tables.py
data.zip		data.zip
etl.ipynb		etl.ipynb
etl.py		etl.py
sql_queries.py		sql_queries.py
test.ipynb		test.ipynb

License

jimmyvluong/postgreSQL_data_modeling

Folders and files

Latest commit

History

Repository files navigation

postgreSQL_data_modeling

Background

Data

Database Schema - The schema consists of one fact table and four dimension tables.

Fact Table

Dimension Tables

Project File Structure

Steps to Follow

A. Creating Tables

B. Build and Implement the ETL Pipeline

ETL Pipeline Overview

Issues faced and solutions

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages