The scripts provide flexibility: ETL - MySQL data extraction => aggregations => loads results to PostgreSQL; EL - allowing transformations at the destination database (aka ELT)
- Python 3.10, PySpark 3.4.1 were used
- MySQL & PostgreSQL installed, a dataset from Kaggle
- Download MySQL & PostgreSQL connector JAR files. Windows users: Choose Java 8 and download winutils.exe from here Linux/macOS users: ensure compatibility with correct versions of Hadoop and Spark.
- Add .env (JAVA_HOME, HADOOP_HOME)
- Disable Firewall, antivirus software, or create appropriate ingress rules
- MySQL Dataset:
- MySQL=>PostgreSQL using PySpark:
- Data with transformation imported:
A small simple task: a pipeline "MS SQL Server=>PostgreSQL"
- Python 3.x (pandas, SQLAlchemy)
- PostgreSQL, SQL Server databases configured
- Install Airflow with Docker, or in a separate repo to avoid conflicts with the script modules
- Download and configure databases
- Load a data sample to a database
- Extract data using python
- Transform (optionally)
- Load to another db Optionally: Schedule with Airflow