Data modeling a Spotify-like music database using postgreSQL within a Python wrapper.
This project creates an ETL pipeline that makes song data available for the analytics team at the startup Sparkify to understand what songs users are listening to. Currently, they don't have an easy way to query their data, which resides in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app. This project creates a Postgres database with tables designed to optimize queries on song play analysis.
- Song Dataset - Each file is in JSON format and contains metadata about a song and the artist of that song. The files are partitioned by the first three letters of each song's track ID. For example, here are filepaths to two files in this dataset:
song_data/A/B/C/TRABCEI128F424C983.json
song_data/A/A/B/TRAABJL12903CDCF1A.json
Below is an example of a single song file:
{"num_songs": 1,
"artist_id" : "ARJIE2Y1187B994AB7",
"artist_latitude": null,
"artist_longitude": null,
"artist_location": "",
"artist_name": "Line Renaud",
"song_id": "SOUPIRU12A6D4FA1E1",
"title": "Der Kleine Dompfaff",
"duration": 152.92036,
"year": 0}- Log Dataset - This data is based on simulated activity logs from a music streaming app based on specified configurations.
The log files in this dataset partitioned by year and month. For example, here are filepaths to two files in this dataset:
log_data/2018/11/2018-11-12-events.json
log_data/2018/11/2018-11-13-events.json
To read one of these json files into a Pandas DataFrame, use a query such as this:
df = pd.read_json('data/log_data/2018/11/2018-11-01-events.json', lines=True)
This project uses a star schema, where one fact table songplays contains data on song play events. This fact table is referenced to 4 dimension tables.
- Why use a star schema for this project?
- Complexity of data - the data is not complex or large
- Joins - necessary for the requested queries
- Structured data - data is in predictable JSON file format
- songplays - records in log data associated with song plays i.e. records with page
NextSong
- songplay_id SERIAL PRIMARY KEY: ID of each user's song play
- start_time TIME: timestamp of start of user's activity
- user_id VARCHAR NOT NULL: user ID
- level VARCHAR: user level, FREE or PAID
- song_id VARCHAR: song ID
- artist_id VARCHAR: artist ID
- session_id VARCHAR: session ID
- location VARCHAR: user location
- user_agent VARCHAR: agent used by user to access Sparkify
- users - the users in the app
- user_id PRIMARY KEY NOT NULL: user ID
- first_name VARCHAR: user first name
- last_name VARCHAR: user last name
- gender VARCHAR: user last name
- level VARCHAR: user level, FREE or PAID
- songs - songs in music database
- song_id VARCHAR PRIMARY KEY NOT NULL: song ID
- title VARCHAR NOT NULL: song title
- artist_id VARCHAR: artist ID
- year INT: year of song release
- duration float: duration of the song in seconds
- artists - artists in music database
- artist_id VARCHAR PRIMARY KEY NOT NULL: artist ID
- name VARCHAR: artist name
- location: artist location
- latitude: location latitude
- longitude: location longitude
- time - timestamps of records in songplays broken down into specific units
- start_time TIME PRIMARY KEY NOT NULL: start time
- hour NUMERIC: hour
- day NUMERIC: day
- week NUMERIC: week
- month VARCHAR: month
- year INT: year
- weekday INT: day of the week
datafolder - contains the song and log datasetssql_queries.pycontains all sql queries, and is imported intocreate_tables.py,etl.ipynb, andetl.pycreate_tables.pydrops and creates your tables. Run this file to reset your tables before each time you run your ETL scripts.test.ipynbdisplays the first few rows of each table to let you check your databaseetl.ipynbreads and processes a single file fromsong_dataandlog_dataand loads the data into your tables. This notebook contains detailed instructions on the ETL process for each of the tables.etl.pyreads and processes files fromsong_dataandlog_dataand loads them into your tables. You can fill this out based on your work in the ETL notebook.README.mdprovides discussion on the project.
- Write
DROPandCREATEtable statements insql_queries.py. - Ran
!python create_tables.pyin the console. - Ran
create_tables.pyto create the database and tables. - Ran
test.ipynbto confirm the creation of tables with the correct columns. Note: Make sure to click "Restart kernel" to close the connection to the database after running this notebook.
- Followed the instructions in
etl.ipynbto develop the ETL process for each table before completing theetl.pyfile to load the whole datasets. - Transferred these steps into
etl.py. - Ran
!python etl.pyin the console. - Verify the data was inserted into the tables with
test.ipynb.
- Connect to the Sparkify database.
- Process song data. Use the
get_filesfunction to get a list of all song JSON files in data/song_data. 2a. Extract data for theSongstable andArtiststable from thesong_datafile. - Process log data. Use the
get_filesfunction to get a list of all song JSON files in data/log_data. 3a. Extract data for thetimetable,userstable, andsongplaystable from thelogfile. Note:timetable is in milliseconds. Since the log file does not specify an ID for either the song or the artist, you'll need to get the song ID and artist ID by querying the songs and artists tables to find matches based on song title, artist name, and song duration time. - Insert the extracted data using
etl.py. 4a. def process_song_file(cur, filepath): Extracts the song files. 4b. def process_log_file(cur, filepath): Extracts the log files. 4c. def process_data(cur, conn, filepath, func): Processes the data extracted from 4a. and 4b. 4d. def main(): Calls the process_data function using process_song_file() and process_log_file(). - Close connection to the Sparkify database.
- Conflicting values for user_ID and artist_ID when inserting data.
Solution: Add
ON CONFLICT (artist_id) DO NOTHING. For example: artist_table_insert = ("""INSERT INTO artists (artist_id, name, location, latitude, longitude) VALUES(%s, %s, %s, %s, %s) ON CONFLICT (artist_id) DO NOTHING ; """) - Convert the ts timestamp column to datetime Solution found here: https://datascience.stackexchange.com/questions/14645/convert-a-pandas-column-of-int-to-timestamp-datatype