Skip to content

Latest commit

 

History

History
797 lines (504 loc) · 28.5 KB

03_contributors_quick_start.rst

File metadata and controls

797 lines (504 loc) · 28.5 KB

Contributor's Quick Start

The outline for this document in GitHub is available at top-right corner button (with 3-dots and 3 lines).

Note to Starters

Airflow is a complex project, but setting up a working environment is quite simple if you follow the guide.

There are three ways you can run the Airflow dev env:

  1. With a Docker Containers and Docker Compose (on your local machine). This environment is managed with Breeze tool written in Python that makes the environment management, yeah you guessed it - a breeze
  2. With a local virtual environment (on your local machine)
  3. With a remote, managed environment (via remote development environment)

Before deciding which method to choose, there are a couple of factors to consider:

  • Running Airflow in a container is the most reliable way: it provides a more consistent environment and allows integration tests with a number of integrations (cassandra, mongo, mysql, etc.). However, it also requires 4GB RAM, 40GB disk space and at least 2 cores.
  • If you are working on a basic feature, installing Airflow on a local environment might be sufficient. For a comprehensive venv tutorial - visit Local virtualenv
  • You need to have usually a paid account to access managed, remote virtual environment.

Local machine development

If you do not work in a remote development environment, you will need these prerequisites:

  1. Docker Community Edition (you can also use Colima, see instructions below)
  2. Docker Compose
  3. Hatch (you can also use pyenv, pyenv-virtualenv or virtualenvwrapper)

The below setup describes Ubuntu installation. It might be slightly different on different machines.

Docker Community Edition

  1. Installing required packages for Docker and setting up docker repo
sudo apt-get update

sudo apt-get install \
    ca-certificates \
    curl \
    gnupg \
    lsb-release

sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg

echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
  1. Install Docker Engine, containerd
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io
  1. Manage docker as non-root user
sudo groupadd docker
sudo usermod -aG docker $USER

Note

This is done so a non-root user can access the docker command. After adding user to docker group Logout and Login again for group membership re-evaluation. On some Linux distributions, the system automatically creates this group.

  1. Test Docker installation
docker run hello-world

Colima

If you use Colima as your container runtimes engine, please follow the next steps:

  1. Install buildx manually and follow its instructions
  2. Link the Colima socket to the default socket path. Note that this may break other Docker servers
sudo ln -sf $HOME/.colima/default/docker.sock /var/run/docker.sock
  1. Change docker context to use default
docker context use default

Docker Compose

  1. Installing latest version of the Docker Compose plugin

Install using the repository:

sudo apt-get update
sudo apt-get install docker-compose-plugin

Install manually:

COMPOSE_VERSION="$(curl -s https://api.github.com/repos/docker/compose/releases/latest | grep '"tag_name":'\
| cut -d '"' -f 4)"

COMPOSE_URL="https://github.com/docker/compose/releases/download/${COMPOSE_VERSION}/\
docker-compose-$(uname -s)-$(uname -m)"

sudo curl -L "${COMPOSE_URL}" -o /usr/local/bin/docker-compose

sudo chmod +x /usr/local/bin/docker-compose

Note

This option requires you to manage updates manually. It is recommended that you set up Docker's repository for easier maintenance.

  1. Verifying installation
docker-compose --version

Setting up virtual-env

  1. While you can use any virtualenv manager, we recommend using Hatch as your build and integration frontend, and we already use hatchling build backend for Airflow. You can read more about Hatch and it's use in Airflow in Local virtualenv. See [PEP-517](https://peps.python.org/pep-0517/#terminology-and-goals) for explanation of what the frontend and backend meaning is
  2. After creating, you need to install a few more required packages for Airflow. The below command adds basic system-level dependencies on Debian/Ubuntu-like system. You will have to adapt it to install similar packages if your operating system is MacOS or another flavour of Linux
sudo apt install openssl sqlite3 default-libmysqlclient-dev libmysqlclient-dev postgresql

If you want to install all airflow providers, more system dependencies might be needed. For example on Debian/Ubuntu like system, this command will install all necessary dependencies that should be installed when you use devel-all extra while installing airflow.

sudo apt install apt-transport-https apt-utils build-essential ca-certificates dirmngr \
freetds-bin freetds-dev git graphviz graphviz-dev krb5-user ldap-utils libffi-dev \
libkrb5-dev libldap2-dev libpq-dev libsasl2-2 libsasl2-dev libsasl2-modules \
libssl-dev locales lsb-release openssh-client sasl2-bin \
software-properties-common sqlite3 sudo unixodbc unixodbc-dev
  1. With Hatch you can enter the virtual environment with hatch shell command, check Local virtualenvs for more details

Forking and cloning Project

  1. Goto https://github.com/apache/airflow/ and fork the project

    Forking Apache Airflow project
  2. Goto your github account's fork of airflow click on Code you will find the link to your repo

    Cloning github fork of Apache airflow
  3. Follow Cloning a repository to clone the repo locally (you can also do it in your IDE - see the Using your IDE chapter below

Note

For windows based machines, on cloning, the Git line endings may be different from unix based systems and might lead to unexpected behaviour on running breeze tooling. Manually setting a property will mitigate this issue. Set it to true for windows.

git config core.autocrlf true

Configuring Pre-commit

Before committing changes to github or raising a pull request, code needs to be checked for certain quality standards such as spell check, code syntax, code formatting, compatibility with Apache License requirements etc. This set of tests are applied when you commit your code.

CI tests GitHub

To avoid burden on CI infrastructure and to save time, Pre-commit hooks can be run locally before committing changes.

Note

We have recently started to recommend uv for our local development.

Note

Remember to have global python set to Python >= 3.9 - Python 3.8 is end-of-life already and we've started to use Python 3.9+ features in Airflow and accompanying scripts.

Installing pre-commit is best done with uv (recommended) or pipx.

  1. Installing required packages

on Debian / Ubuntu, install via

sudo apt install libxml2-utils

on macOS, install via

brew install libxml2
  1. Installing pre-commit:
uv tool install pre-commit --with pre-commit-uv

You can add uv support for pre-commit even you install it with pipx using the commands (then pre-commit will use uv to create virtualenvs for the hooks):

pipx install pre-commit
pipx install inject pre-commit pre-commit-uv # optional, configures pre-commit to use uv to install virtualenvs
  1. Go to your project directory
cd ~/Projects/airflow
  1. Running pre-commit hooks
pre-commit run --all-files
  No-tabs checker......................................................Passed
  Add license for all SQL files........................................Passed
  Add license for all other files......................................Passed
  Add license for all rst files........................................Passed
  Add license for all JS/CSS/PUML files................................Passed
  Add license for all JINJA template files.............................Passed
  Add license for all shell files......................................Passed
  Add license for all python files.....................................Passed
  Add license for all XML files........................................Passed
  Add license for all yaml files.......................................Passed
  Add license for all md files.........................................Passed
  Add license for all mermaid files....................................Passed
  Add TOC for md files.................................................Passed
  Add TOC for upgrade documentation....................................Passed
  Check hooks apply to the repository..................................Passed
  black................................................................Passed
  Check for merge conflicts............................................Passed
  Debug Statements (Python)............................................Passed
  Check builtin type constructor use...................................Passed
  Detect Private Key...................................................Passed
  Fix End of Files.....................................................Passed
  ...........................................................................
  1. Running pre-commit for selected files
pre-commit run  --files airflow/utils/decorators.py tests/utils/test_task_group.py
  1. Running specific hook for selected files
pre-commit run black --files airflow/decorators.py tests/utils/test_task_group.py
  black...............................................................Passed
pre-commit run ruff --files airflow/decorators.py tests/utils/test_task_group.py
  Run ruff............................................................Passed
  1. Enabling Pre-commit check before push. It will run pre-commit automatically before committing and stops the commit
cd ~/Projects/airflow
pre-commit install
git commit -m "Added xyz"
  1. To disable Pre-commit
cd ~/Projects/airflow
pre-commit uninstall

Setting up Breeze

For many of the development tasks you will need Breeze to be configured. Breeze is a development environment which uses docker and docker-compose and its main purpose is to provide a consistent and repeatable environment for all the contributors and CI. When using Breeze you avoid the "works for me" syndrome - because not only others can reproduce easily what you do, but also the CI of Airflow uses the same environment to run all tests - so you should be able to easily reproduce the same failures you see in CI in your local environment.

  1. Install uv or pipx. We recommend to install uv as general purpose python development environment - you can install it via https://docs.astral.sh/uv/getting-started/installation/ or you can install pipx (>=1.2.1) - follow the instructions in Install pipx It is important to install version of pipx >= 1.2.1 to workaround packaging breaking change introduced in September 2023
  2. Run uv tool install -e ./dev/breeze (or pipx install -e ./dev/breeze in your checked-out repository. Make sure to follow any instructions printed by during the installation - this is needed to make sure that breeze command is available in your PATH

Warning

If you see below warning while running pipx - it means that you hit the known issue with packaging version 23.2: ⚠️ Ignoring --editable install option. pipx disallows it for anything but a local path, to avoid having to create a new src/ directory.

The workaround is to downgrade packaging to 23.1 and re-running the pipx install command, for example by running pip install "packaging<23.2".

pip install "packaging==23.1"
pipx install -e ./dev/breeze --force
  1. Initialize breeze autocomplete
breeze setup autocomplete
  1. Initialize breeze environment with required python version and backend. This may take a while for first time.
breeze --python 3.9 --backend postgres

Note

If you encounter an error like "docker.credentials.errors.InitializationError: docker-credential-secretservice not installed or not available in PATH", you may execute the following command to fix it:

sudo apt install golang-docker-credential-helper

Once the package is installed, execute the breeze command again to resume image building.

  1. When you enter Breeze environment you should see prompt similar to root@e4756f6ac886:/opt/airflow#. This means that you are inside the Breeze container and ready to run most of the development tasks. You can leave the environment with exit and re-enter it with just breeze command
  2. Once you enter breeze environment, create airflow tables and users from the breeze CLI. airflow db reset is required to execute at least once for Airflow Breeze to get the database/tables created. If you run tests, however - the test database will be initialized automatically for you
root@b76fcb399bb6:/opt/airflow# airflow db reset
root@b76fcb399bb6:/opt/airflow# airflow users create \
        --username admin \
        --firstname FIRST_NAME \
        --lastname LAST_NAME \
        --role Admin \
        --email admin@example.org
  1. Exiting Breeze environment. After successfully finishing above command will leave you in container, type exit to exit the container. The database created before will remain and servers will be running though, until you stop breeze environment completely
root@b76fcb399bb6:/opt/airflow# exit
  1. You can stop the environment (which means deleting the databases and database servers running in the background) via breeze down command
breeze down

Using Breeze

  1. Starting breeze environment using breeze start-airflow starts Breeze environment with last configuration run( In this case python and backend will be picked up from last execution breeze --python 3.9 --backend postgres) It also automatically starts webserver, backend and scheduler. It drops you in tmux with scheduler in bottom left and webserver in bottom right. Use [Ctrl + B] and Arrow keys to navigate.
breeze start-airflow

    Use CI image.

 Branch name:            main
 Docker image:           ghcr.io/apache/airflow/main/ci/python3.9:latest
 Airflow source version: 2.4.0.dev0
 Python version:         3.9
 Backend:                mysql 5.7


 Port forwarding:

 Ports are forwarded to the running docker containers for webserver and database
   * 12322 -> forwarded to Airflow ssh server -> airflow:22
   * 28080 -> forwarded to Airflow webserver -> airflow:8080
   * 29091 -> forwarded to Airflow FastAPI API -> airflow:9091
   * 25555 -> forwarded to Flower dashboard -> airflow:5555
   * 25433 -> forwarded to Postgres database -> postgres:5432
   * 23306 -> forwarded to MySQL database  -> mysql:3306
   * 26379 -> forwarded to Redis broker -> redis:6379

 Here are links to those services that you can use on host:
   * ssh connection for remote debugging: ssh -p 12322 airflow@127.0.0.1 (password: airflow)
   * Webserver: http://127.0.0.1:28080
   * FastAPI API:    http://127.0.0.1:29091
   * Flower:    http://127.0.0.1:25555
   * Postgres:  jdbc:postgresql://127.0.0.1:25433/airflow?user=postgres&password=airflow
   * Mysql:     jdbc:mysql://127.0.0.1:23306/airflow?user=root
   * Redis:     redis://127.0.0.1:26379/0
Accessing local airflow
  • Alternatively you can start the same using following commands

    1. Start Breeze
    breeze --python 3.9 --backend postgres
    1. Open tmux
    root@0c6e4ff0ab3d:/opt/airflow# tmux
    1. Press Ctrl + B and "
    root@0c6e4ff0ab3d:/opt/airflow# airflow scheduler
    1. Press Ctrl + B and %
    root@0c6e4ff0ab3d:/opt/airflow# airflow webserver
  1. Now you can access airflow web interface on your local machine at http://127.0.0.1:28080 with user name admin and password admin

    Accessing local airflow
  2. Setup a PostgreSQL database in your database management tool of choice (e.g. DBeaver, DataGrip) with host 127.0.0.1, port 25433, user postgres, password airflow, and default schema airflow

    Connecting to postgresql
  3. Stopping breeze

If breeze was started with breeze start-airflow, this command will stop breeze and Airflow:

root@f3619b74c59a:/opt/airflow# stop_airflow
breeze down

If breeze was started with breeze --python 3.9 --backend postgres (or similar):

root@f3619b74c59a:/opt/airflow# exit
breeze down

Note

stop_airflow is available only when breeze is started with breeze start-airflow.

  1. Knowing more about Breeze
breeze --help

Following are some of important topics of Breeze documentation:

Installing airflow in the local venv

  1. It may require some packages to be installed; watch the output of the command to see which ones are missing
sudo apt-get install sqlite3 libsqlite3-dev default-libmysqlclient-dev postgresql
./scripts/tools/initialize_virtualenv.py
  1. Add following line to ~/.bashrc in order to call breeze command from anywhere
export PATH=${PATH}:"/home/${USER}/Projects/airflow"
source ~/.bashrc

Running tests with Breeze

You can usually conveniently run tests in your IDE (see IDE below) using virtualenv but with Breeze you can be sure that all the tests are run in the same environment as tests in CI.

All Tests are inside ./tests directory.

  • Running Unit tests inside Breeze environment.

    Just run pytest filepath+filename to run the tests.

root@63528318c8b1:/opt/airflow# pytest tests/utils/test_dates.py
============================================================= test session starts ==============================================================
platform linux -- Python 3.9.20, pytest-8.3.3, pluggy-1.5.0 -- /usr/local/bin/python
cachedir: .pytest_cache
rootdir: /opt/airflow
configfile: pyproject.toml
plugins: anyio-4.6.0, time-machine-2.15.0, icdiff-0.9, rerunfailures-14.0, instafail-0.5.0, custom-exit-code-0.3.0, xdist-3.6.1, mock-3.14.0, cov-5.0.0, asyncio-0.24.0, requests-mock-1.12.1, timeouts-1.2.1
asyncio: mode=strict, default_loop_scope=None
setup timeout: 0.0s, execution timeout: 0.0s, teardown timeout: 0.0s
collected 4 items

tests/utils/test_dates.py::TestDates::test_parse_execution_date PASSED                                                                           [ 25%]
tests/utils/test_dates.py::TestDates::test_round_time PASSED                                                                                     [ 50%]
tests/utils/test_dates.py::TestDates::test_infer_time_unit PASSED                                                                                [ 75%]
tests/utils/test_dates.py::TestDates::test_scale_time_units PASSED                                                                               [100%]

================================================================== 4 passed in 3.30s ===================================================================
  • Running All the test with Breeze by specifying required python version, backend, backend version
breeze --backend postgres --postgres-version 15 --python 3.9 --db-reset testing tests --test-type All
  • Running specific type of test

    breeze --backend postgres --postgres-version 15 --python 3.9 --db-reset testing tests --test-type Core
  • Running Integration test for specific test type

    breeze --backend postgres --postgres-version 15 --python 3.9 --db-reset testing tests --test-type All --integration mongo
  • For more information on Testing visit 09_testing.rst

  • Similarly to regular development, you can also debug while testing using your IDE, for more information, you may refer to

    Local and Remote Debugging in IDE

Contribution guide

  • To know how to contribute to the project visit README.rst

Raising Pull Request

  1. Go to your GitHub account and open your fork project and click on Branches

    Goto fork and select branches
  2. Click on New pull request button on branch from which you want to raise a pull request

    Accessing local airflow
  3. Add title and description as per Contributing guidelines and click on Create pull request

    Accessing local airflow

Syncing Fork and rebasing Pull request

Often it takes several days or weeks to discuss and iterate with the PR until it is ready to merge. In the meantime new commits are merged, and you might run into conflicts, therefore you should periodically synchronize main in your fork with the apache/airflow main and rebase your PR on top of it. Following describes how to do it.

Using your IDE

If you are familiar with Python development and use your favourite editors, Airflow can be setup similarly to other projects of yours. However, if you need specific instructions for your IDE you will find more detailed instructions here:

Using Remote development environments

In order to use remote development environment, you usually need a paid account, but you do not have to setup local machine for development.


Once you have your environment set up, you can start contributing to Airflow. You can find more about ways you can contribute in the How to contribute document.