I. PySpark ETL/EL

MySQL=>PostgreSQL PySpark pipeline

The scripts provide flexibility: ETL - MySQL data extraction => aggregations => loads results to PostgreSQL; EL - allowing transformations at the destination database (aka ELT)

Prerequisites

Python 3.10, PySpark 3.4.1 were used
MySQL & PostgreSQL installed, a dataset from Kaggle
Download MySQL & PostgreSQL connector JAR files. Windows users: Choose Java 8 and download winutils.exe from here Linux/macOS users: ensure compatibility with correct versions of Hadoop and Spark.
Add .env (JAVA_HOME, HADOOP_HOME)
Disable Firewall, antivirus software, or create appropriate ingress rules

Results

MySQL Dataset:

MySQL=>PostgreSQL using PySpark:

Data with transformation imported:

II. Pandas ETL/EL

A small simple task: a pipeline "MS SQL Server=>PostgreSQL"

Prerequisites

Python 3.x (pandas, SQLAlchemy)
PostgreSQL, SQL Server databases configured
Install Airflow with Docker, or in a separate repo to avoid conflicts with the script modules

Download and configure databases
Load a data sample to a database
Extract data using python
Transform (optionally)
Load to another db Optionally: Schedule with Airflow

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
connector		connector
images		images
pandas_pipeline		pandas_pipeline
pyspark_pipeline		pyspark_pipeline
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

I. PySpark ETL/EL

MySQL=>PostgreSQL PySpark pipeline

Prerequisites

Results

II. Pandas ETL/EL

Prerequisites

Result

About

Releases

Packages

Languages

nekoduykod/spark-pipelines

Folders and files

Latest commit

History

Repository files navigation

I. PySpark ETL/EL

MySQL=>PostgreSQL PySpark pipeline

Prerequisites

Results

II. Pandas ETL/EL

Prerequisites

Result

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages