In this project, a extraction-transfer-load pipeline for a data lake is built using Spark. Again, this project is quite similar to the previous projects, with the main difference in this case being in the utilization of a spark cluster. In total, five tables should be created and populated using data from two different S3 sources:
- songplays, a fact table listing who played which song by which artist at what time, extracted from log data
- users, a dimension table containing user information like name and level, extracted from log data
- songs, a dimension table containing song information like title and length, extracted from song data
- artists, a dimension table containing artist information like name and location, extracted from song data
- time, a dimension table that lists the units of time (hours, minutes etc.) for each timestamp in the fact table, extracted from log data
The relevant file for submission is etl.py. Given the presense of a suitable spark installation, it can be executed from the cli via:
python etl.py
Unfortunately, the environment that was provided kept throwing 403 errors when trying to read from S3. Therefore, the script was developed and tested on Amazon EMR could not be tested in the udacity enviroment.
There are two relevant functions in etl.py, process_song_data() and process_log_data(). These functions perform operations that were done with SQL queries in previous exercises:
- Extract song and artist information from song data
- Extract user, time and songplay information from log data. The latter requires a join with song_data in order to make the fields
song_idandartist_idavailable in the songplay data.
The following solutions were used as guidance: