GitHub - surbhithole/spark_data_lake: Created a Data lake using Spark, Python, SQL, SparkSQL.

Introduction

In this project I am creating an ETL pipline to extract data Songs Data from S3 and processes it using Spark, and loads the data back into a new S3 bucket as a set of fact and dimensional tables. This will allow analytics team to continue finding insights in what songs users are listening to.

Datasets Used

Contains Songs information
Contains Log information about the users and the songs they listen to, etc

Data Model

How to run the ETL Process:

Update AWS credentials in dl.cfg
python etl.py

Files in the repository:

etl.py --> Python file to execute the data pipeline
dl.cfg --> Update the AWS credentials in this file

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
sample_data		sample_data
README.md		README.md
dl.cfg		dl.cfg
etl.py		etl.py
sparkify_erd.png		sparkify_erd.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

Datasets Used

Data Model

How to run the ETL Process:

Files in the repository:

About

Uh oh!

Releases

Packages

Languages

surbhithole/spark_data_lake

Folders and files

Latest commit

History

Repository files navigation

Introduction

Datasets Used

Data Model

How to run the ETL Process:

Files in the repository:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages