This repository provides an easy guide on Apache Airflow, explaining how to create, run, and manage data pipelines. It includes steps for installing Airflow using Docker, making the setup easier. The guide also covers basic concepts like DAGs (Directed Acyclic Graphs), which show workflows, and operators that define tasks in those workflows.
- Basic concepts of Airflow
- Code Example for Using a DAG
- Install Apache Airflow using Docker
- Accessing the Environment
- Disclaimer
A DAG is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. The DAG is "directed" because the tasks must be run in a specific order, and it's "acyclic" because it doesn't contain any cycles, meaning a task can’t depend on itself either directly or indirectly. In Airflow, DAGs define how tasks are scheduled and triggered, but the DAG itself does not perform any actions.
Key Point: A DAG defines the structure and flow of tasks but doesn't execute them directly.
A task is a single unit of work within a DAG. Each task represents a specific operation, such as pulling data from a database, processing data, or sending an email notification. In Airflow, a task is defined by an Operator and can be subject to scheduling, retry logic, and other runtime behaviors.
Key Point: Tasks are the individual pieces of work within a DAG.
Operators are templates that define what actions a task should perform. Airflow provides different operators for different types of tasks, such as:
BashOperator: Executes a bash command.
PythonOperator: Executes a Python function.
EmailOperator: Sends an email.
Key Point: An operator defines what action a task will perform.
The executor is responsible for running the tasks defined by the DAG. It defines how and where tasks are executed. There are different types of executors in Airflow, such as:
SequentialExecutor: Runs tasks one by one.
LocalExecutor: Runs tasks in parallel on the local machine.
Key Point: The executor determines how tasks are distributed and executed across resources.
The code example will be in the dags directory, containing files such as:
- bash_DAG.py : code example for bash operator.
- python_DAG.py : code example for python operator.
- CatchUp_explain_DAG.py : How to use and set the Catchup parameter.
- taskflow_API_DAG.py : How to create a DAG using the TaskFlow API.
- cron_in_DAG.py : How to schedule your DAG using a Cron expression.
- Check that your Docker has more than 4 GB of RAM.
docker run --rm "debian:bookworm-slim" bash -c "numfmt --to iec $(echo $(($(getconf _PHYS_PAGES) * $(getconf PAGE_SIZE))))"
- Download the docker-compose.yaml file for Apache Airflow.
curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.10.2/docker-compose.yaml'
or
Invoke-WebRequest -Uri 'https://airflow.apache.org/docs/apache-airflow/2.10.2/docker-compose.yaml' -OutFile 'docker-compose.yaml'
or
just copy text from https://airflow.apache.org/docs/apache-airflow/2.10.2/docker-compose.yaml
- Run mkdir to create directories: dags, logs, plugins, and config.
mkdir dags, logs, plugins, config
- Create a .env file to declare AIRFLOW_UID.
$AIRFLOW_UID = [System.Security.Principal.WindowsIdentity]::GetCurrent().User.Value
echo "AIRFLOW_UID=$AIRFLOW_UID" > .env
or
echo "AIRFLOW_UID=50000" > .env
or
create .env file and paste AIRFLOW_UID=50000
- Run
docker-compose up airflow-init
- Run
docker-compose up
- Install Apache Airflow python library using
pip install apache-airflow
you can list python to install in docker-compose.yaml
_PIP_ADDITIONAL_REQUIREMENTS: "${_PIP_ADDITIONAL_REQUIREMENTS:-your_library}"
# Example
_PIP_ADDITIONAL_REQUIREMENTS: "${_PIP_ADDITIONAL_REQUIREMENTS:-numpy pandas matplotlib}"
# you can look at line 69 -79 in docker-compose.yaml in this repository
you use valume key in docker-compose.yaml for copy your file to Airflow container
volumes:
- [your_file_path]:[Destination_file_path]
# Example
volumes:
- ${AIRFLOW_PROJ_DIR:-.}/my_file:/opt/airflow/file_in_container
- ${AIRFLOW_PROJ_DIR:-.}/dags:/opt/airflow/dags
- ${AIRFLOW_PROJ_DIR:-.}/logs:/opt/airflow/logs
- ${AIRFLOW_PROJ_DIR:-.}/config:/opt/airflow/config
- ${AIRFLOW_PROJ_DIR:-.}/plugins:/opt/airflow/plugins
# you can look at line 69 -79 in docker-compose.yaml in this repository
After starting Airflow, you can interact with it in three ways:
- Via a browser using the web interface.
- By using the REST API.
- By running CLI commands.
The webserver is available at http://localhost:8080. The default login credentials are:
- Username: airflow
- Password: airflow
The webserver is also available at http://localhost:8080. The default login credentials are:
- Username: airflow
- Password: airflow
Example command to send a request to the REST API:
ENDPOINT_URL="http://localhost:8080/"
curl -X GET \
--user "airflow:airflow" \
"${ENDPOINT_URL}/api/v1/pools"