-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
4 changed files
with
360 additions
and
83 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,63 @@ | ||
# DE-data-modelling-postgres | ||
# Data Modelling with Postgres | ||
|
||
|
||
## Overview | ||
|
||
This project builds an ETL pipeline to create, fetch, process, and populate DB for the music streaming app Sparkify. | ||
It has a Postgres database with tables designed to optimize queries on song play analysis. | ||
It has a star schema - 1 fact table `songplay` and a few supporting tables. | ||
|
||
## Structure | ||
|
||
The project contains the following elements: | ||
* `create_tables.py` connect and creates the database and tables | ||
* `sql_queries.py` contains the SQL queries for creating and inserting | ||
* `etl.ipynb` and `test.ipynb` test the ETL pipeline | ||
* `etl.py` defines the ETL pipeline | ||
* `data/` contains song and log JSON files | ||
|
||
## Schema | ||
|
||
### Fact Table | ||
- **songplays** - records in log data associated with song plays i.e. records with page `NextSong` | ||
> songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent | ||
### Dimension Tables | ||
- **users** - users in the app | ||
> user_id, first_name, last_name, gender, level | ||
- **songs** - songs in music database | ||
> song_id, title, artist_id, year, duration | ||
- **artists** - artists in music database | ||
> artist_id, name, location, latitude, longitude | ||
- **time** - timestamps of records in songplays broken down into specific units | ||
> start_time, hour, day, week, month, year, weekday | ||
## Instructions | ||
|
||
Open terminal and enter the following: | ||
|
||
``` | ||
python3 create_tables.py | ||
python3 etl.py | ||
``` | ||
|
||
|
||
## Query Example | ||
|
||
Test with the following queries: | ||
|
||
``` | ||
# Connect to database | ||
%load_ext sql | ||
%sql postgresql://student:student@127.0.0.1/sparkifydb | ||
# First 5 songs | ||
%sql SELECT * FROM songs LIMIT 5; | ||
# Count the number of artists | ||
%sql SELECT COUNT(*) FROM artists; | ||
# Count the number of song plays by user with id = 69 | ||
%sql SELECT COUNT(*) FROM songplays WHERE user_id = 69; | ||
``` |
Oops, something went wrong.