Movies-ETL (Extract, Transform, Load) Practice

The Extract, Transform, and Load (ETL) process is used to create data pipelines in this module. The ETL process creates data pipelines that transform data as it travels from one location to another. In order to perform analysis, one must first set up a pipeline of data that can be used for analysis. As a result, the ability to perform data analysis without the use of ETL is extremely limited.

Project Overview

To find out which low-budget films will go viral, Amazing Prime, a streaming service, is hosting a hackathon in which participants must guess which ones will be the most successful. Predictive models can only be built with an ETL pipeline. The goal of this project is to create an ETL pipeline that can be refactored to process large datasets into a SQL database.

Project Deliverables

- Deliverable 1: Write an ETL Function to Read Three Data Files
- Deliverable 2: Extract and Transform the Wikipedia Data
- Deliverable 3: Extract and Transform the Kaggle Data
- Deliverable 4: Create the Movie Database

Resources

1- Extract - ETL_function_test.ipynb
2.1- Transform - ETL_clean_wiki_movies.ipynb
2.2- Transform ETL_clean_kaggle_data.ipynb
3- Load - ETL_create_database.ipynb
4- Languages: Python, SQL
5- Tools: Jupyter Notebook, JSON, Pandas, NumPy, sqlalchemy, psycopg2

Results

Extract Phase - Data from two different file formats, csv and json, is read into the Python environment using Pandas.
CSV Files from Kaggle. (Note: Due to the size of some files, these CSV files have not been uploaded to the GitHub repository)
Wikipedia Movies JSON File. (Note: Due to its size the file can only be seen in RAW format)

Transform Phase A dataset that can be loaded into the database has been transformed into a usable dataset. It was necessary to look for and deal with any missing or null values, deal with corrupt data, and format the cleaned data. After that, the data was cleaned up, formatted, and organized into manageable groups before being merged.

Load Phase- A connection to the database was established in the Python environment and the final dataset was loaded into two SQL tables.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
DataFiles		DataFiles
Images		Images
__pycache__		__pycache__
.gitignore		.gitignore
ETL_clean_kaggle_data.ipynb		ETL_clean_kaggle_data.ipynb
ETL_clean_wiki_movies.ipynb		ETL_clean_wiki_movies.ipynb
ETL_create_database.ipynb		ETL_create_database.ipynb
ETL_function_test.ipynb		ETL_function_test.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Movies-ETL (Extract, Transform, Load) Practice

Project Overview

Project Deliverables

Resources

Results

Image 1 - `movies` Table Query. This table contains `6,052` rows of movies metadata

Image 2 - `ratings` Table Query. This table contains `24,024,289` rows of ratings data

About

Releases

Packages

Languages

Peteresis/Movies-ETL

Folders and files

Latest commit

History

Repository files navigation

Movies-ETL (Extract, Transform, Load) Practice

Project Overview

Project Deliverables

Resources

Results

Image 1 - movies Table Query. This table contains 6,052 rows of movies metadata

Image 2 - ratings Table Query. This table contains 24,024,289 rows of ratings data

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Image 1 - `movies` Table Query. This table contains `6,052` rows of movies metadata

Image 2 - `ratings` Table Query. This table contains `24,024,289` rows of ratings data

Packages