Welcome to the Foundation Workspace repository! This project aims to provide a comprehensive workspace environment for data engineering tasks involving Airflow, Spark, Hive, and Azure Data Lake Gen2. By leveraging Docker, users can easily set up a consistent environment with all necessary dependencies for their ETL (Extract, Transform, Load) workflows.
Tool | Version | Description |
---|---|---|
Docker | 24.0.7 |
See Mac installation instructions. |
Java 17 SDK | openjdk-17-jre-headless |
In DockerFile "RUN apt-get install -y openjdk-17-jre-headless". |
Airflow | apache/airflow:2.8.4-python3.10 |
Base Image. See release history here |
Spark | version 3.5.1 |
bitnami/spark:latest See release history here. |
Hive | apache/hive:4.0.0-alpha-2 |
See release history here. |
Azure Data Lake Gen2 | hadoop-azure-3.3.1.jar |
The JAR must be configured during Spark Submit here. |
Python | 3.10 |
Installed using apache/airflow:2.8.4-python3.10 Image . |
PySpark | version 3.5.1 |
This should match the Spark version. |
-
Dockerized Environment: The project offers Docker containers configured with Airflow, Spark, Hive, and Azure Data Lake Gen2 dependencies, ensuring seamless setup across different platforms.
-
Complete ETL Examples: Explore two comprehensive ETL examples included in the repository:
- Azure Data Lake Gen2: Connect and perform ETL operations using PySpark, demonstrating integration with Azure Data Lake Gen2.
- Local Metastore: Work with a local metastore and perform ETL tasks using PySpark and Hive, showcasing flexibility in different data storage setups.
-
Diverse DAGs: Various types of Directed Acyclic Graphs (DAGs) are provided, incorporating Python operators and Bash operators to demonstrate different workflow configurations and task executions.
-
Configuration Files: Essential configuration files such as Dockerfile, Java layout, and Docker Compose files are included, simplifying setup and customization of the workspace environment.
Clone the repository to your local machine
git clone https://github.com/aaliashraf/airflow-spark-hive-azure-docker-workflow.git
Navigate to the Repo directory
cd airflow-spark-hive-azure-docker-workflow
docker compose build
Run the following command to generate the .env file containing the required Airflow UID
echo AIRFLOW_UID=1000 > .env
docker compose up
After starting the containers, you can access the services through the following URLs:
Username: airflow
Password: airflow
-
Go To http://localhost:8080
- Go To http://localhost:8181
- Go To http://localhost:10002
- /dags: Contains Airflow DAGs and workflows for ETL tasks.
- /logs: Airflow logs.
- /plugins: Airflow plugins.
- /src: Utility scripts, PySpark code, and JARs.
- /metastore: Contains Hive database and tables locally.
- Dockerfile: Dockerfiles for building custom Docker images.
- docker-compose.yaml: Docker Compose file for orchestrating containers.
have fun! 🚀🚀🚀