ETL data pipeline project to ingest data from TIA Public API.
posts_pipeline- runs hourly to ingest 30 latest posts
comments_pipeline- runs daily at 1AM to ingest yesterday posts comments
- Docker and Docker-Compose
- have 2GB RAM allocated for the VM and 2GB free disk space to run the services.
-
Clone or download this repository to your machine.
-
Open your terminal/console and make sure
dockeranddocker-composecommands are executable. -
Go to this repository root directory in using your terminal.
-
Execute
docker-compose buildto build airflow, postgres, adminer as docker services.The command will consume a lot of network data, so prepare for the internet. If there is any error, try to execute command several times until it succeed. if it's no luck, raise an issue.
-
After all the service built, execute
docker-compose up -dto start the services.You can inspect the running services by executing
docker-compose psto see if the services are up or failing. if one of the service is failing, you may want to inspect it usingdocker-compose logs <service_name>(eg.docker-compose logs airflow). another issue that may happen are the existing services with the same port used by our services. if that happens, try to stop the existing service(s) first and then start our project services. -
After all the service running, you can start to open the Airflow UI at http://localhost:8080 to make sure the service really went up.
- Open Airflow UI at http://localhost:8080.
- Login using
adminas username andadminas password. - You will redirected to home page, you can find our paused DAGs here. Let's start with the
posts_pipelineDAG.
-
Click on the
posts_pipelinetitle to enter the DAG page or open http://localhost:8080/tree?dag_id=posts_pipeline. -
On the post pipeline DAG page, you can click the blue switch to start the DAG.
another thing you may want to check in this page is to see the
Tree ViewandGraph Viewto check if the tasks are running, retrying or failed. you may want to check the DAG code by clicking theCodetab.
-
After all inspection done, you can pause the DAG by switch it back to off or leave it run on schedule.
-
You can do the same for the
comments_pipeline
- Open Adminer web UI at localhost:32767.
- Login using
airflowas username andairflowas password. make sure usePostgreSQLas system,postgresas server, andairflow_dbas database.
- After logged in, you can start to monitor the posts data ingested by choosing DB:
airflow_db, Schema:publicand click on thepoststable.
You can start to edit any availabe dags under docker/airflow/dags folder by using any code editor you prefered. the Docker volume system will map the changes to the associated dags in the container.
You may want to change any DAG schedule by changing the cron expression in the dag
schedule_intervalparameter.
After you modified the dags file, you can inspect it in the edited dag page on Airflow UI and see the changes applied on Code tab.
You can test DAGs pipeline using the command:
docker-compose exec airflow airflow dags test <dag_id> <desired_execution_date>
Here is the example to test the posts_pipeline dag:
docker-compose exec airflow airflow dags test comments_pipeline 2022-01-01
and here is the example to test the comments_pipeline dag:
docker-compose exec airflow airflow dags test comments_pipeline 2022-01-01T01:00:00
When you are developing a task, you may want to test it individually using the command:
docker-compose exec airflow airflow tasks test <dag_id> <task_id> <desired_execution_date>
Here is the example of testing one of posts_pipeline task:
docker-compose exec airflow airflow tasks test posts_pipeline is_tia_public_api_accessible 2022-01-01
here is the example of testing one of comments_pipeline task:
docker-compose exec airflow airflow tasks test comments_pipeline is_postgres_accessible 2022-01-01T01:00:00
To stop all services, you can use the docker-compose command:
docker-compose down




