In this project, I build a simple data pipeline following the ETL(extract - transform - load) model using Youtube-Trending-Video dataset, perform data processing, transformation and calculation using Apache Spark big data technology, serving the video search and recommendation system
- Data Source: This project uses two main
data sources
: Youtube Trending Video data and Youtube APIYoutube Trending Video
data is downloaded from Kaggle.com with.csv
file format, then loaded intoMySQL
, considered as adata source
- Using
Video ID
andCategory ID
fromYoutube Trending Video
data, we collect some additional information fields fromYoutube API
such asVideo Link
andVideo Category
- Extract Data: Extract the above
data sources
usingPolars
DataFrame
, now we have theraw
layer, then load the data intoMinIO
datalake
- Tranform Data: From
MinIO
, we useApache Spark
, specificallyPySpark
- convert from
Polars
DataFrame
toPySpark
DataFrame
for processing and calculation, we getsilver
andgold
layers - Data stored in
MinIO
is in.parquet
format, providing better processing performance
- convert from
- Load Data: Load the
gold
layer into thePostgreSQL
data warehouse, perform additional transform withdbt
to create anindex
, making video searching faster - Serving: The data was used for visualization using
Metabase
and creating a video recommendation application usingStreamlit
- package and orchestrator: Use
Docker
to containerize and package projects andDagster
to coordinateassets
across different tasks
MySQL
Youtube API
Polars
MinIO
Apache Spark
PostgreSQL
Dbt
Metabase
Streamlit
Dagster
Docker
Apache Superset
Unittest
Pytest
Here's what you can do with:
- You can completely change the logic or create new
assets
in thedata pipeline
as you wish, performaggregate
calculations
on theassets
in thepipeline
according to your purposes. - You can also create new
data charts
as well as change existingcharts
as you like with extremely diversechart types
onMetabase
andApache Superset
. - You can also create new or change my existing
dashboards
as you like Search
videos quickly with anykeyword
, forVideo Recommendation
AppsSearch
in many different languages, not justEnglish
such as:Japanese
,Canadian
,German
,Indian
,Russian
- Recommend videos based on
category
andtags
video
During this project, I learned important skills, understood complex ideas, knew how to install and set up popular and useful tools, which brought me closer to becoming a Data Engineer
.
- Logical thinking: I learned how to think like a data person, find the cause of the data
problem
and then come up with the mostreasonable solution
to achieve high dataaccuracy
. - Architecture: I understand and grasp the
ideas
andarchitecture
of today's popular and popular big data processing tool,Apache Spark
. - Installation: I learned how to install popular data processing, visualization and storage tools such as:
Metabase
,Streamlit
,MinIO
,... withDocker
- Setup: I know how to setup
Spark Standalone Cluster
usingDocker
with threeWorker Nodes
on my local machine
Each part of this project has helped me understand more about how to build a data engineering, data management project. Learn new knowledge and improve my skills in future work
- Add more
data sources
to increase data richness. - Refer to other
data warehouses
besidesPostgreSQL
such asAmazon Redshift
orSnowflake
. - Perform more
cleaning
andoptimization
processing
of the data. - Perform more advanced
statistics
,analysis
andcalculations
withApache Spark
. - Check out other popular and popular
data orchestration
tools likeApache Airflow
. - Separate
dbt
into a separate service (separatecontainer
) indocker
when the project expands - Setup
Spark Cluster
oncloud platforms
instead of onlocal machines
- Refer to
cloud computing
services if the project is more extensive - Learn about
dbt packages
likedbt-labs/dbt_utils
to help make thetransformation
process faster and more optimal.
To run the project in your local environment, follow these steps:
- Run command after to clone the
repository
to yourlocal machine
.
git clone https://github.com/longNguyen010203/Youtube-ETL-Pipeline.git
- Run the following commands to build the images from the
Dockerfile
, pull images fromdocker hub
and launch services
make build
make up
- Run the following commands to access the
SQL editor
on theterminal
and Check iflocal_infile
was turned on
make to_mysql_root
SET GLOBAL local_infile=TRUE;
SHOW VARIABLES LIKE "local_infile";
exit
- Run the following commands to create tables with schema for
MySQL
, load data fromCSV
file toMySQL
and create tables with schema forPostgreSQL
make mysql_create
make mysql_load
make psql_create
- Open http://localhost:3001 to view
Dagster UI
and clickMaterialize all
button to run the Pipeline - Open http://localhost:9001 to view
MinIO UI
and check the data to be loaded - Open http://localhost:8080 to view
Spark UI
and threeworkers
are running - Open http://localhost:3030 to see charts and
dashboards
onMetabase
- Open http://localhost:8501 to try out the
video recommendation
app onStreamlit