From 89084722df617c69336b71487f87e8bc409bedb8 Mon Sep 17 00:00:00 2001 From: David McClosky Date: Sun, 18 Feb 2024 12:02:40 -0500 Subject: [PATCH 1/3] Revamp docs, add Quick Start doc We now have a Quick Start based on a more complete template (https://github.com/quant-aq/aeromancy-project-template/pull/3). It provides a guided tour of the basic components, but the current draft is still missing instructions on how to actually run experiments. Other docs have been reorganized a bit and "customizing.md" has been pulled out of "scaffolding.md". --- README.md | 48 ++----- docs/docs/customizing.md | 46 +++++++ docs/docs/index.md | 41 +++++- docs/docs/quick_start.md | 289 +++++++++++++++++++++++++++++++++++++++ docs/docs/scaffolding.md | 58 +------- docs/docs/setup.md | 41 ++++++ docs/mkdocs.yml | 10 +- 7 files changed, 436 insertions(+), 97 deletions(-) create mode 100644 docs/docs/customizing.md create mode 100644 docs/docs/quick_start.md create mode 100644 docs/docs/setup.md diff --git a/README.md b/README.md index 1be50ee..fa25915 100644 --- a/README.md +++ b/README.md @@ -18,54 +18,32 @@ by providing both new infrastructure (a more comprehensive versioning scheme including both system runtimes and external datasets) and a corresponding set of best practices to ensure experiments are maximally trackable. -In its current form, Aeromancy requires a fairly specific software stack: +In its current form, Aeromancy requires a fairly specific software stack: (hey, +we said it was opinionated) - **Experiment tracker**: [Weights and Biases](https://wandb.ai) - **Object storage** (artifacts): S3-compatible, e.g., [Ceph](https://github.com/ceph/ceph) - **Virtualization**: [Docker](https://www.docker.com/) +- **Python Package Manager**: [pdm](https://pdm.fming.dev) +- **Revision Control**: [Git](https://git-scm.com/) **Note:** As is likely obvious, Aeromancy documentation is in a very early state. As this is a pre-release support may be limited. For now, we include a couple pointers for how to setup your environment for Aeromancy. -## Getting started +## Documentation overview -**Coming soon**: A proper Getting Started section. +- If you're new to Aeromancy, [start here](docs/docs/quick_start.md)! +- In the Developer Reference section of the documentation, we include some + design docs which provide an [architectural overview](docs/docs/scaffolding.md) and a + [glossary](docs/docs/tasks.md) of terms. +- To see autogenerated docs for code from this repo, you'll need to start a + local doc server (`pdm doc`). -To quickly set up an Aeromancy project, we've created a -[Copier](https://copier.readthedocs.io/en/stable/) template. See instructions at -the -[quant-aq/aeromancy-project-template](https://github.com/quant-aq/aeromancy-project-template?tab=readme-ov-file#quick-start). - -## Requirements - -- Python 3.10.5 or higher -- [`pdm`](https://pdm.fming.dev): Install via `pip install --user pdm` then - install Aeromancy packages with `pdm install`. -- **Environment variables**: - - S3 backend location and credentials: - - `AEROMANCY_AWS_ACCESS_KEY_ID` - - `AEROMANCY_AWS_SECRET_ACCESS_KEY` - - `AEROMANCY_AWS_S3_ENDPOINT_URL` - - `AEROMANCY_AWS_REGION` - - `WANDB_API_KEY` (from [Weights and Biases](https://wandb.ai)) -- **SSH Authentication**: You'll want `ssh-agent` setup if you need to access - private GitHub repositories. Check out these - [instructions](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent). - -### Mac OS - -- Use [Homebrew](https://brew.sh/) to install the following: - - `brew install apache-arrow@13.0.0_5 bat@0.23.0 graphviz@8.1.0 - openblas@0.3.24 pre-commit@3.3.3` -- Install Docker Desktop from [docker.com](https://www.docker.com/) (not Brew - since it has a trickier upgrade story) - -## Common commands +## Common development commands - `pdm lint`: Run pre-commit linters - `pdm test`: Run test suite - `pdm doc`: Start doc server (see also the [public - version](https://quant-aq.github.io/aeromancy/) for the latest checked in - version) + version](https://quant-aq.github.io/aeromancy/) for the latest release) diff --git a/docs/docs/customizing.md b/docs/docs/customizing.md new file mode 100644 index 0000000..ff847e9 --- /dev/null +++ b/docs/docs/customizing.md @@ -0,0 +1,46 @@ + +# Customizing Aeromancy projects + +To quickly set up an Aeromancy project, we've created a +[Copier](https://copier.readthedocs.io/en/stable/) template. See instructions at +the +[quant-aq/aeromancy-project-template](https://github.com/quant-aq/aeromancy-project-template?tab=readme-ov-file#quick-start). + +In the generated Python project setup (`pyproject.toml`), you may also want to +adjust: + +- **Extra Python packages:** Add them with `pdm add `. See [PDM + docs](https://pdm.fming.dev/latest/usage/dependency/) for more information on + this. +- **`pdm` [scripts](https://pdm.fming.dev/latest/usage/scripts/)**: Some of + these are necessary for running Aeromancy (like `pdm go`), but you can add + more if there are common tasks for your project. +- **Extra `docker run` arguments**: (E.g., mounting + [volumes](https://docs.docker.com/engine/reference/commandline/run/#mount)). + These can be baked `pdm go` script with `--extra-docker-run-args='...'`. The + [template](https://github.com/quant-aq/aeromancy-project-template) includes a + standard volume mapping (`data/`) for ingesting datasets. +- **Extra Debian packages:** (outside of those included by Aeromancy), you may + want to bake them into the `pdm go` script with `--extra-debian-package='...'` + (specify the flag once per package name). + +## Filesystem layout + +Ultimately, the structure of an Aeromancy project should look something like +this: + +```text +/ + pyproject.toml + pdm.lock + main.py # AeroMain + src/ + / + .py + .py +``` + +The structure of the classes containing your +[`Action`][aeromancy.action.Action](s) and +[`ActionBuilder`][aeromancy.action_builder.ActionBuilder] is flexible -- they +just need to be importable in AeroMain. diff --git a/docs/docs/index.md b/docs/docs/index.md index 612c7a5..fa1c06a 100644 --- a/docs/docs/index.md +++ b/docs/docs/index.md @@ -1 +1,40 @@ ---8<-- "README.md" +# Aeromancy + +[![Tests](https://github.com/quant-aq/aeromancy/actions/workflows/ci.yml/badge.svg)](https://github.com/quant-aq/aeromancy/actions/workflows/ci.yml) +[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) +[![pdm-managed](https://img.shields.io/badge/pdm-managed-blueviolet)](https://pdm.fming.dev) +[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff) +[![pre-commit enabled](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit&logoColor=white)](https://pre-commit.com/) +![Apache 2.0 licensed](https://img.shields.io/github/license/quant-aq/aeromancy) + +**Aeromancy** is an opinionated philosophy and open-sourced framework that +closely tracks experimental runtime environments for more reproducible machine +learning. In existing experiment trackers, it’s easy to miss important details +about how an experiment was run, e.g., which version of a dataset was used as +input or the exact versions of library dependencies. Missing these details can +make replicability more difficult. Aeromancy aims to make this process smoother +by providing both new infrastructure (a more comprehensive versioning scheme +including both system runtimes and external datasets) and a corresponding set of +best practices to ensure experiments are maximally trackable. + +In its current form, Aeromancy requires a fairly specific software stack: (hey, +we said it was opinionated) + +- **Experiment tracker**: [Weights and Biases](https://wandb.ai) +- **Object storage** (artifacts): S3-compatible, e.g., + [Ceph](https://github.com/ceph/ceph) +- **Virtualization**: [Docker](https://www.docker.com/) +- **Python Package Manager**: [pdm](https://pdm.fming.dev) +- **Revision Control**: [Git](https://git-scm.com/) + +**Note:** As is likely obvious, Aeromancy documentation is in a very early +state. As this is a pre-release support may be limited. + +## Documentation overview + +- If you're new to Aeromancy, [start here](quick_start.md)! +- In the Developer Reference section of the documentation, we include some + design docs which provide an [architectural overview](scaffolding.md) and a + [glossary](tasks.md) of terms. +- Lastly, we have autogenerated documentation in [Code + Reference](reference/aeromancy/index.md). diff --git a/docs/docs/quick_start.md b/docs/docs/quick_start.md new file mode 100644 index 0000000..881f3ee --- /dev/null +++ b/docs/docs/quick_start.md @@ -0,0 +1,289 @@ +# Quick start + +This guide will walk you through some of the basic Aeromancy workflows. We'll be +using Aeromancy in "development" mode which lets us focus on key Aeromancy +concepts. + +## Creating a project + +To quickly set up an Aeromancy project, we've created a +[Copier](https://copier.readthedocs.io/en/stable/) template at +[quant-aq/aeromancy-project-template](https://github.com/quant-aq/aeromancy-project-template?tab=readme-ov-file#quick-start). +Let's start by creating a new project called `aerodemo`: + +1. Install [PDM](https://pdm.fming.dev) with + [Copier](https://copier.readthedocs.io/en/stable/) support: + + ```bash + pip install --user "pdm[copier]" + ``` + +2. Set up a new Aeromancy-managed project with the template. This will create + the project directory `aerodemo` for you: + + ```bash + copier copy --trust "gh:quant-aq/aeromancy-project-template" aerodemo + ``` + + The template will various questions. For the purpose of this Quick Start, + it's fine to fill in `aerodemo` or defaults for all fields. + +3. Install project dependencies: + + ```bash + cd aerodemo + git init + pdm install --dev --no-self + ``` + +## What's in an Aeromancy project? + +Aeromancy projects contain several different components. For now, we'll start +with the three most important: (see [Tasks, Trackers, and Actions](tasks.md) for +more details on these and the other main classes) + +### Actions + +[`Action`][aeromancy.action.Action]s define a specific data transformation you'd +like to track with Aeromancy (e.g., training a model or performing a step in a +data processing pipeline). If you're familiar with +[Luigi](https://luigi.readthedocs.io/en/stable/) and other pipeline builders, +this may be familiar. [`Action`][aeromancy.action.Action]s roughly correspond to +a run on [Weights and Biases](https://docs.wandb.ai/quickstart) (Aeromancy will +help you create the runs on the Weights and Biases side). + +In `src/aerodemo/actions.py`, we include three example +[`Action`][aeromancy.action.Action]s: `ExampleStoreAction`,`ExampleTrainAction`, +and `ExampleEvaluationAction`. Let's walk through these. + +#### Creating `Artifact`s with `ExampleStoreAction` + +```python +class ExampleStoreAction(Action): + """Example Aeromancy `Action` to store an existing dataset.""" +``` + +[`Action`][aeromancy.action.Action]s have class attributes help you organize +your Actions and will be exposed later in experiment trackers like [Weights and +Biases](https://docs.wandb.ai/quickstart). From most general to most specific, here are the three organizational levels Weights and Biases (and thus Aeromancy) provides: + +- `project_name` (defined by [`ActionBuilder`][aeromancy.action_builder.ActionBuilder]) + - `job_group` + - `job_type` + - individual [`Action`][aeromancy.action.Action]s + + Our example represents a typical ML flow with three +[`Action`][aeromancy.action.Action]s: + +1. `job_group=model, job_type=store-dataset`: Store the dataset as a tracked + artifact in Aeromancy (more on artifacts soon!) +2. `job_group=model, job_type=train-model`: Train a model from the dataset +3. `job_group=model, job_type=eval-emodel`: Evaluate a model on the dataset + +```python + job_type = "store-dataset" + job_group = "model" +``` + +`outputs()` tells Aeromancy what artificts this Action produces. Most +[`Action`][aeromancy.action.Action]s only create a single thing (e.g., a +training action creates a model, an evaluation action could output its +predictions over the dataset) but multiple outputs are allowed. Also note that +these can be dynamically generated based on the configuration of the +[`Action`][aeromancy.action.Action]. + +```python + @override + def outputs(self) -> list[str]: + return ["example-dataset"] +``` + +`run()` defines the actual logic that should be tracked (train a model, +transform a dataset, etc.). Within `run()`, we're responsible for declaring +input and output artifacts with the provided +[`Tracker`][aeromancy.tracker.Tracker]. Much of the work in this example centers +around configuring an output artifact with `tracker.declare_output()`. Why is +this so complicated? Declaring an output artifact has several effects which +Aeromancy will bind together: + +1. It creates a tracked (versioned) artifact from a set of local files. +2. This makes the artifact usable in downstream + [`Action`][aeromancy.action.Action] -- we'll access the files through + Aeromancy rather than directly from disk, in fact, since it will ensure that + we're using the correct version of it. +3. It will store the artifact to an S3-compatible blob store, creating a + permanent and versioned reference to the contents (well, as permanent + as the blob store). +4. It will create a corresponding Weights and Biases artifact which will + be associated with the corresponding Weights and Biases run and the + Aeromancy Artifact. + +```python + @override + def run(self, tracker: Tracker) -> None: + print("Hello world from ExampleStoreAction.") +``` + +Our dataset already exists on disk in a special directory (`data/`) which is +accessible both inside and outside the Docker container. This should generally +only be used for initial dataset ingestion -- downstream +[`Action`][aeromancy.action.Action]s should not use this path. + +```python + dataset_paths = [ + Path("data/example_train_data.txt"), + Path("data/example_test_data.txt"), + ] +``` + +We can associate arbitrary metrics with the dataset: + +```python + dataset_metadata = { + "num_train_records": dataset_paths[0].read_text().splitlines(), + "num_test_records": dataset_paths[1].read_text().splitlines(), + } +``` + +We'll use `outputs()` from above to keep artifact names in sync. + +```python + [dataset_artifact_name] = self.outputs() +``` + +Now we're ready to declare `dataset_artifact_name` as an output dependency with +`tracker.declare_output`. We'll go over each argument: + +- `name`: This is the name of the artifact we're declaring. This name is used in + many places: + + 1. It needs to match one of the names in list of artifact names returned by + `outputs()`, so it will be part of the name of any jobs that run this + Action. + 2. Downstream [`Action`][aeromancy.action.Action]s will be able to refer to + this artifact by this name. + 3. This is also the name of the corresponding Weights and Biases artifact. +- `local_filenames`: A list of files that should be included in the artifact. +- `s3_destination`: Where to store the artifact in the blob store -- this + includes the bucket and key (a path prefix). This is purely for organization + purposes -- naming destinations clearly could also aid with debugging but in + general, you won't need to know or use S3 paths. +- `artifact_type`: This is purely for organization purposes and will be exposed + in Weights and Biases. We recommend a human-readable version of the file type. +- `metadata`: This is an optional property for any extra metadata that you'd + like to associate with the artifact (it will also be exposed in Weights and + Biases). It can also include nested data and store a wide range of types. +- `strip_prefix`: This is the portion of the `local_filenames` paths that we + don't want to use include in our artifact names on the blob store. In this + case, this means we'll store `data/example_train_data.txt` as + `dataset/bogus-example_train_data.txt` in the `example-bucket` bucket (the + `dataset/` comes from our `s3_destination` key). + +```python + tracker.declare_output( + name=dataset_artifact_name, + local_filenames=dataset_paths, + s3_destination=S3Object("example-bucket", "dataset/"), + artifact_type="dataset", + metadata=dataset_metadata, + strip_prefix="data/", + ) +``` + +We've created our first [`Action`][aeromancy.action.Action]. Next, let's look at +`ExampleTrainAction` which will use the dataset stored by `ExampleStoreAction`. + +#### Using configuration options and `Artifact`s with `ExampleTrainAction` + +We'll focus on the novel parts of `ExampleTrainAction` (see the generated code +for some additional commentary). First, we'll introduce a configuration +parameter. Parameters can be anything that changes behavior or helps you +organize your experiments -- these include hyperparameters, toggling features, +or your own metadata. Let's look at `__init__` where `learning_rate` is our +example configuration parameter. Also note that we take a reference to a +`ExampleStoreAction`. This will indicate a dependency and help Aeromancy know +that it needs to run first. You might also be wondering about where +`store_dataset` and `learning_rate` are set -- this will happen later in our +[`ActionBuilder`][aeromancy.action_builder.ActionBuilder]. + +```python + def __init__( + self, + store_dataset: ExampleStoreAction, + learning_rate: float, + ): + self.learning_rate = learning_rate +``` + +We need to call our superconstructor which include `store_dataset` as a parent +Action as well as our configuration parameter: + +```python + Action.__init__(self, parents=[store_dataset], learning_rate=learning_rate) +``` + +In our `run()` method, now we'll be able to use the artifact from our parent: + +```python + @override + def run(self, tracker: Tracker) -> None: + print("Hello world from ExampleTrainAction.") +``` + +This demonstrates `get_io()`, a helper method to simultaneously provide input +and output artifact names. Most [`Action`][aeromancy.action.Action]s include a +call to this. Note that inputs and outputs are each lists which is why we're +using brackets to unpack these. Also note that the order of the input artifact +names will follow the order of parent [`Action`][aeromancy.action.Action]s (see +`ExampleEvaluationAction` for an example of an +[`Action`][aeromancy.action.Action] with multiple parents and thus multiple +input artifacts). + +```python + [dataset_artifact_name], [model_artifact_name] = self.get_io() +``` + +Once we know the name of our input artifact, we need to declare it as a +dependency. This is the counterpart of `tracker.declare_output()` from +`ExampleStoreAction`. It will resolve the artifact to the appropriate version +and return the paths we should use to read the dataset. + +```python + dataset_paths = tracker.declare_input(dataset_artifact_name) + + train_data = dataset_paths[0].read_text() + print(f"Training data: {train_data!r}") +``` + +### ActionBuilder + +An [`ActionBuilder`][aeromancy.action_builder.ActionBuilder] +(`src/aerodemo/action_builder.py`) is responsible for +constructing a dependency graph of [`Action`][aeromancy.action.Action]s. + +TODO: code walkthrough + +### AeroMain + +`src/main.py`, typically referred to as AeroMain is the main entry point to an +Aeromancy project, responsible for determining configuration options, +constructing an [`ActionBuilder`][aeromancy.action_builder.ActionBuilder], and +launching it. + +TODO: code walkthrough + +## Running our first experiments + +TODO: `pdm go` etc. + +## What's next? + +We've gone through all the main components you'll need to define to run +experiments in Aeromancy. Next up, you might want to: + +- [Configure](setup.md) Aeromancy to work with Weights and Biases and + S3-compatible blob stores +- TODO: Developing and Debugging in Aeromancy (`bailout`, `--debug`, common + pitfalls, `aeroset`, `aeroview`, `rerun` commands) +- [Customizing](customizing.md) your Aeromancy project +- TODO: best practices diff --git a/docs/docs/scaffolding.md b/docs/docs/scaffolding.md index b55ebfc..540d0b1 100644 --- a/docs/docs/scaffolding.md +++ b/docs/docs/scaffolding.md @@ -6,7 +6,7 @@ doing the cross-references. --> In order to enable tracking, Aeromancy is rather opinionated about how projects are set up. A "project" in this case means a pipeline of tasks, potentially configurable through CLI flags. This document provides an overview of the -components involved and how to set up a new Aeromancy project. +components involved. This diagram roughly shows the flow: @@ -81,59 +81,3 @@ generated [`Action`][aeromancy.action.Action]s. See [Tasks, Trackers, and Actions](tasks.md) for more information on these objects. - -## Creating a new Aeromancy project - -In order to set up a new project, you'll need a Git repository with these -components: - -- Actions (subclasses of [`Action`][aeromancy.action.Action] with specific logic - for your tasks) -- An [`ActionBuilder`][aeromancy.action_builder.ActionBuilder] to instantiate - the [`Action`][aeromancy.action.Action] objects and describe their - dependencies -- An "AeroMain" script to parse any project-specific options and bring it all - together - -To quickly set up an Aeromancy project, we've created a -[Copier](https://copier.readthedocs.io/en/stable/) template. See instructions at -the -[quant-aq/aeromancy-project-template](https://github.com/quant-aq/aeromancy-project-template?tab=readme-ov-file#quick-start). -In the generated Python project setup (`pyproject.toml`), you may also want to -adjust: - -- **Extra Python packages:** Add them with `pdm add `. See [PDM - docs](https://pdm.fming.dev/latest/usage/dependency/) for more information on - this. -- **`pdm` [scripts](https://pdm.fming.dev/latest/usage/scripts/)**: Some of - these are necessary for running Aeromancy (like `pdm go`), but you can add - more if there are common tasks for your project. -- **Extra `docker run` arguments**: E.g., mounting - [volumes](https://docs.docker.com/engine/reference/commandline/run/#mount)). - These can be baked `pdm go` script with `--extra-docker-run-args='...'`. -- **Extra Debian packages:** (outside of those included by Aeromancy), you may - want to bake them into the `pdm go` script with `--extra-debian-package='...'` - (specify the flag once per package name). -- **Development environment (linters, etc.):** Aeromancy encourages the use of - the `ruff` linter and `Black` formatter, but these are customizable. - -### Filesystem layout - -Ultimately, the structure of an Aeromancy project should look something like -this: - -```text -/ - pyproject.toml - pdm.lock - main.py # AeroMain - src/ - / - .py - .py -``` - -The structure of the classes containing your -[`Action`][aeromancy.action.Action](s) and -[`ActionBuilder`][aeromancy.action_builder.ActionBuilder] is flexible -- they -just need to be importable in AeroMain. diff --git a/docs/docs/setup.md b/docs/docs/setup.md new file mode 100644 index 0000000..fd3ef95 --- /dev/null +++ b/docs/docs/setup.md @@ -0,0 +1,41 @@ +# Installing and setting up Aeromancy + +The easiest way to setup Aeromancy is to follow the [Quick +Start](quick_start.md) guide. This document includes additional setup +instructions for running Aeromany in "production" mode. + +- **Python**: Aeromancy works with Python 3.10.5 or higher +- **Python package manager**: Aeromancy currently requires [`pdm`](https://pdm.fming.dev). + + - Install via `pip install --user pdm` + +- **Environment variables**: + + - To use an S3-compatible backend (e.g., + [Ceph](https://github.com/ceph/ceph)), you'll need to set these + environmental variables: + + - `AEROMANCY_AWS_ACCESS_KEY_ID` + - `AEROMANCY_AWS_SECRET_ACCESS_KEY` + - `AEROMANCY_AWS_S3_ENDPOINT_URL` + - `AEROMANCY_AWS_REGION` (can be left empty if it doesn't apply) + + - You'll also need to set `WANDB_API_KEY` (from [Weights and Biases](https://wandb.ai)) + +- **SSH Authentication**: You'll want `ssh-agent` setup if you need to access + private GitHub repositories. Check out these + [instructions](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent). + +## Linux + +You'll want to install some packages. On Debian, you can use: + +- `apt install bat graphviz libopenblas-dev pre-commit docker.io` + +## Mac OS + +- We recommend using [Homebrew](https://brew.sh/) to install the following: + - `brew install apache-arrow@13.0.0_5 bat@0.23.0 graphviz@8.1.0 + openblas@0.3.24 pre-commit@3.3.3` +- Install Docker Desktop from [docker.com](https://www.docker.com/) (not Brew + since it has a trickier upgrade story) diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml index cab935a..599d6d5 100644 --- a/docs/mkdocs.yml +++ b/docs/mkdocs.yml @@ -7,8 +7,12 @@ site_dir: "site" nav: - Home: - - Overview: index.md - - Scaffolding and new projects: scaffolding.md + - Introduction: index.md + - Quick Start: quick_start.md + - Seting up Aeromancy: setup.md + - Customizing your project: customizing.md + - Developer Reference: + - Scaffolding: scaffolding.md - Tasks, Trackers, and Actions: tasks.md - Code Reference: reference/ @@ -28,7 +32,6 @@ markdown_extensions: - pymdownx.snippets: base_path: - docs - - ../README.md check_paths: true - pymdownx.superfences: custom_fences: @@ -59,4 +62,3 @@ plugins: watch: - "../src" - - "../README.md" From 975271fc07f951d979951c9e01844bf04a67ce30 Mon Sep 17 00:00:00 2001 From: David McClosky Date: Fri, 23 Feb 2024 11:16:12 -0500 Subject: [PATCH 2/3] Match renames in the template, lots of new text (see https://github.com/quant-aq/aeromancy-project-template/pull/3/commits/eab1af7ebb1125a79a531dfbda558afea3b640af) --- docs/docs/index.md | 5 +- docs/docs/quick_start.md | 291 ++++++++++++++++++++++++++++++++------- docs/mkdocs.yml | 2 + 3 files changed, 247 insertions(+), 51 deletions(-) diff --git a/docs/docs/index.md b/docs/docs/index.md index fa1c06a..7078628 100644 --- a/docs/docs/index.md +++ b/docs/docs/index.md @@ -27,8 +27,9 @@ we said it was opinionated) - **Python Package Manager**: [pdm](https://pdm.fming.dev) - **Revision Control**: [Git](https://git-scm.com/) -**Note:** As is likely obvious, Aeromancy documentation is in a very early -state. As this is a pre-release support may be limited. +!!! note + Aeromancy documentation is still in a very early state. As this is a + pre-release, support may be limited. ## Documentation overview diff --git a/docs/docs/quick_start.md b/docs/docs/quick_start.md index 881f3ee..8756fa0 100644 --- a/docs/docs/quick_start.md +++ b/docs/docs/quick_start.md @@ -1,8 +1,6 @@ # Quick start -This guide will walk you through some of the basic Aeromancy workflows. We'll be -using Aeromancy in "development" mode which lets us focus on key Aeromancy -concepts. +This guide will walk you through some of the basic Aeromancy workflows. ## Creating a project @@ -25,8 +23,8 @@ Let's start by creating a new project called `aerodemo`: copier copy --trust "gh:quant-aq/aeromancy-project-template" aerodemo ``` - The template will various questions. For the purpose of this Quick Start, - it's fine to fill in `aerodemo` or defaults for all fields. + The template will ask a lot of questions. For the purpose of this Quick + Start, it's fine to fill in `aerodemo` or defaults for all fields. 3. Install project dependencies: @@ -53,14 +51,18 @@ a run on [Weights and Biases](https://docs.wandb.ai/quickstart) (Aeromancy will help you create the runs on the Weights and Biases side). In `src/aerodemo/actions.py`, we include three example -[`Action`][aeromancy.action.Action]s: `ExampleStoreAction`,`ExampleTrainAction`, +[`Action`][aeromancy.action.Action]s: `ExampleIngestAction`,`ExampleTrainAction`, and `ExampleEvaluationAction`. Let's walk through these. -#### Creating `Artifact`s with `ExampleStoreAction` +!!! note + We'll likely be simplifying the [`Action`][aeromancy.action.Action] API in + the near future. We hope to streamline it significantly. + +#### Creating `Artifact`s with `ExampleIngestAction` ```python -class ExampleStoreAction(Action): - """Example Aeromancy `Action` to store an existing dataset.""" +class ExampleIngestAction(Action): + """Example Aeromancy `Action` to ingest an existing dataset.""" ``` [`Action`][aeromancy.action.Action]s have class attributes help you organize @@ -75,13 +77,13 @@ Biases](https://docs.wandb.ai/quickstart). From most general to most specific, h Our example represents a typical ML flow with three [`Action`][aeromancy.action.Action]s: -1. `job_group=model, job_type=store-dataset`: Store the dataset as a tracked +1. `job_group=model, job_type=ingest-dataset`: Store the dataset as a tracked artifact in Aeromancy (more on artifacts soon!) 2. `job_group=model, job_type=train-model`: Train a model from the dataset 3. `job_group=model, job_type=eval-emodel`: Evaluate a model on the dataset ```python - job_type = "store-dataset" + job_type = "ingest-dataset" job_group = "model" ``` @@ -102,26 +104,29 @@ these can be dynamically generated based on the configuration of the transform a dataset, etc.). Within `run()`, we're responsible for declaring input and output artifacts with the provided [`Tracker`][aeromancy.tracker.Tracker]. Much of the work in this example centers -around configuring an output artifact with `tracker.declare_output()`. Why is -this so complicated? Declaring an output artifact has several effects which -Aeromancy will bind together: - -1. It creates a tracked (versioned) artifact from a set of local files. -2. This makes the artifact usable in downstream - [`Action`][aeromancy.action.Action] -- we'll access the files through - Aeromancy rather than directly from disk, in fact, since it will ensure that - we're using the correct version of it. -3. It will store the artifact to an S3-compatible blob store, creating a - permanent and versioned reference to the contents (well, as permanent - as the blob store). -4. It will create a corresponding Weights and Biases artifact which will - be associated with the corresponding Weights and Biases run and the - Aeromancy Artifact. +around configuring an output artifact with +[`tracker.declare_output`][aeromancy.Tracker.declare_output]. + +!!! question + Why is this so complicated? Declaring an output artifact has several effects + which Aeromancy will bind together: + + 1. It creates a tracked (versioned) artifact from a set of local files. + 2. This makes the artifact usable in downstream + [`Action`][aeromancy.action.Action] -- we'll access the files through + Aeromancy rather than directly from disk, in fact, since it will ensure that + we're using the correct version of it. + 3. It will store the artifact to an S3-compatible blob store, creating a + permanent and versioned reference to the contents (well, as permanent + as the blob store). + 4. It will create a corresponding Weights and Biases artifact which will + be associated with the corresponding Weights and Biases run and the + Aeromancy Artifact. ```python @override def run(self, tracker: Tracker) -> None: - print("Hello world from ExampleStoreAction.") + print("Hello world from ExampleIngestAction.") ``` Our dataset already exists on disk in a special directory (`data/`) which is @@ -152,7 +157,8 @@ We'll use `outputs()` from above to keep artifact names in sync. ``` Now we're ready to declare `dataset_artifact_name` as an output dependency with -`tracker.declare_output`. We'll go over each argument: +[`tracker.declare_output`][aeromancy.Tracker.declare_output]. We'll go over each +argument: - `name`: This is the name of the artifact we're declaring. This name is used in many places: @@ -191,7 +197,7 @@ Now we're ready to declare `dataset_artifact_name` as an output dependency with ``` We've created our first [`Action`][aeromancy.action.Action]. Next, let's look at -`ExampleTrainAction` which will use the dataset stored by `ExampleStoreAction`. +`ExampleTrainAction` which will use the dataset stored by `ExampleIngestAction`. #### Using configuration options and `Artifact`s with `ExampleTrainAction` @@ -201,25 +207,25 @@ parameter. Parameters can be anything that changes behavior or helps you organize your experiments -- these include hyperparameters, toggling features, or your own metadata. Let's look at `__init__` where `learning_rate` is our example configuration parameter. Also note that we take a reference to a -`ExampleStoreAction`. This will indicate a dependency and help Aeromancy know +`ExampleIngestAction`. This will indicate a dependency and help Aeromancy know that it needs to run first. You might also be wondering about where -`store_dataset` and `learning_rate` are set -- this will happen later in our +`ingest_dataset` and `learning_rate` are set -- this will happen later in our [`ActionBuilder`][aeromancy.action_builder.ActionBuilder]. ```python def __init__( self, - store_dataset: ExampleStoreAction, + ingest_dataset: ExampleIngestAction, learning_rate: float, ): self.learning_rate = learning_rate ``` -We need to call our superconstructor which include `store_dataset` as a parent +We need to call our superconstructor which include `ingest_dataset` as a parent Action as well as our configuration parameter: ```python - Action.__init__(self, parents=[store_dataset], learning_rate=learning_rate) + Action.__init__(self, parents=[ingest_dataset], learning_rate=learning_rate) ``` In our `run()` method, now we'll be able to use the artifact from our parent: @@ -244,8 +250,9 @@ input artifacts). ``` Once we know the name of our input artifact, we need to declare it as a -dependency. This is the counterpart of `tracker.declare_output()` from -`ExampleStoreAction`. It will resolve the artifact to the appropriate version +dependency. This is the counterpart of +[`tracker.declare_output`][aeromancy.Tracker.declare_output] from +`ExampleIngestAction`. It will resolve the artifact to the appropriate version and return the paths we should use to read the dataset. ```python @@ -255,35 +262,221 @@ and return the paths we should use to read the dataset. print(f"Training data: {train_data!r}") ``` +#### Logging metrics + +As we've already seen, we can associate arbitrary metadata/metrics with +artifacts as part of +[`tracker.declare_output`][aeromancy.Tracker.declare_output]. We can also log +metrics about the status of an `Action` with +[`tracker.log`][aeromancy.Tracker.log]. Returning to the `run()` method in +`ExampleTrainAction`: + +```python + # Now we pretend to train a model. + num_iterations = 10 + # Seeding your RNG is always a good idea for better reproducibility. + rng = random.Random(x=7) + for step in range(num_iterations): + # We can store information about the experiment while it's being + # run. + tracker.log( + { + "step": step, + "train_error": rng.random(), + }, + ) +``` + ### ActionBuilder An [`ActionBuilder`][aeromancy.action_builder.ActionBuilder] -(`src/aerodemo/action_builder.py`) is responsible for -constructing a dependency graph of [`Action`][aeromancy.action.Action]s. +(`src/aerodemo/action_builder.py`) is responsible for constructing a dependency +graph of [`Action`][aeromancy.action.Action]s. It will be able to receive +options from the command-line in `__init__`: -TODO: code walkthrough +```python + def __init__( + self, + learning_rate: float, + ): + """Create an `ActionBuilder` for aerodemo.""" + # The project name is for organizational purposes and will be the + # project name in Weights and Biases. + ActionBuilder.__init__(self, project_name="aerodemo") + + self.learning_rate = learning_rate +``` + +The main logic here happens in +[`build_actions`][aeromancy.ActionBuilder.build_actions], which constructs the +[`Action`][aeromancy.action.Action] objects we defined above. When we construct +an [`Action`][aeromancy.action.Action], we need to add it to a list using +[`self.add_action`][aeromancy.ActionBuilder.add_action]: + +!!! note + This API is likely to be simplified in the near future. + +```python + @override + def build_actions(self) -> list[Action]: + actions = [] + + # Build each Action in sequence. Note that we use the helper method + # add_action rather than appending to the list directly, since + # add_action needs to do some work behind the scenes. + ingest_action = self.add_action(actions, ExampleIngestAction(parents=[])) + train_action = self.add_action( + actions, + ExampleTrainAction( + ingest_dataset=ingest_action, + learning_rate=self.learning_rate, + ), + ) + self.add_action( + actions, + ExampleEvaluationAction( + ingest_dataset=ingest_action, + train_model=train_action, + ), + ) + return actions +``` ### AeroMain -`src/main.py`, typically referred to as AeroMain is the main entry point to an -Aeromancy project, responsible for determining configuration options, -constructing an [`ActionBuilder`][aeromancy.action_builder.ActionBuilder], and -launching it. +`src/main.py`, typically referred to as **AeroMain**, is the command-line entry +point to an Aeromancy project, responsible for determining configuration +options, constructing an +[`ActionBuilder`][aeromancy.action_builder.ActionBuilder], and launching it. By +default, Aeromancy will always look for AeroMain in `src/main.py`. -TODO: code walkthrough +It uses [Click](https://click.palletsprojects.com/) for option parsing and +Aeromancy provides a bundle of its own options in +[`@aeromancy_click_options`][aeromancy.click_options.aeromancy_click_options]. +Using [`rich.console`](https://rich.readthedocs.io/en/stable/console.html) for +console logging is optional. + +```python +@click.command() +@click.option( + "-l", + "--learning-rate", + metavar="FLOAT", + default=1e-3, + type=float, + help="Learning rate in optimizer.", +) +# We also need to include a list of standard Aeromancy options. +@aeromancy_click_options +# Make sure to include any new options we created as arguments to aeromain. +def aeromain( + learning_rate: float, + **aeromancy_options, +): + """CLI application for controlling aerodemo.""" +``` + +Within the `aeromain()` function, we construct an +[`ActionBuilder`][aeromancy.action_builder.ActionBuilder] (you can use more than +one if you have several similar pipelines in the same experiment), then convert +it to to an [`ActionRunner`][aeromancy.action_runner.ActionRunner] and run the +actions: + +```python + config = {"learning_rate": learning_rate} + console.log("Config parameters from CLI:", config) + + # This builds our Action dependency graph given the configuration passed in. + action_builder = ExampleActionBuilder(**config) + # We create a corresponding runner to execute the dependency graph and kick + # it off. + action_runner = action_builder.to_runner() + action_runner.run_actions(**aeromancy_options) +``` ## Running our first experiments -TODO: `pdm go` etc. +Aeromancy projects all include standard scripts for running Aeromancy. The main +script is called `go` which runs AeroMain. For the Quick Start, we'll use +development mode with the `--dev` flag. + +!!! info + **Development mode** makes it easy to test and develope pipelines quickly. + It lets you run uncommitted code outside of a Docker container and Weights + and Biases to speed up the developer loop. It will attempt to read artifacts + from S3 so doesn't work completely offline (unless you already have the + artifacts cached from previous development mode runs). It's behavior is very + close to "production" mode with the main exception that it is not + necessarily using the same artifact versions. + +### Listing available [`Action`][aeromancy.action.Action]s + +Let's start by listing all the +[`Action`][aeromancy.action.Action] with `--list`: + +```bash +pdm go --dev --list +``` + +You should see something like this: + +```bash +[12:00:00] Running 'pdm run python src/main.py --list' +[12:00:01] Config parameters from CLI: + {'learning_rate': 0.001} +[ingest-dataset] example-dataset +[train-model] example-model +[eval-model] example-model-predictions +``` + +We can see the results of our `console.log` statement with the default value for + the learning rate parameter. This is followed by a list of all +[`Action`][aeromancy.action.Action]s our +[`ActionBuilder`][aeromancy.action_builder.ActionBuilder] built. The `job_type` +is shown in brackets, followed by a list of output artifacts. + +### Running the pipeline + +Assuming we're happy with the [`Action`][aeromancy.action.Action]s, we can run +them all by omitting `--list`: + +```bash +pdm go --dev +``` + +You should see it run each [`Action`][aeromancy.action.Action] in sequence. +Don't worry if it's overwhelming at first. Because we're running in development +mode, we're using a [fake tracker][aeromancy.fake_tracker.FakeTracker] instead +of the production Weights and Biases tracker, so you'll see a lot of messages +from it about what would happen if we were running in production mode. + +### Job selection + +Sometimes (in our experience, often) we don't want to run the entire pipeline. +To run just some of the jobs, pass the `--only` flag. Aeromancy will then only +run jobs with a name that includes that substring. You can pass it a +comma-separated list. Note that names include the `job_type` as well. + +!!! example + + - If you pass `--only train`, it will just run `ExampleTrainAction` + + - If you pass `--only model`, it will run `ExampleTrainAction` then + `ExampleEvaluationAction` (since the latter depends on the former) + + - If you pass `--only dataset,train`, it will run `ExampleIngestAction` then + `ExampleTrainAction` ## What's next? We've gone through all the main components you'll need to define to run -experiments in Aeromancy. Next up, you might want to: +experiments in Aeromancy and how to run them in development mode. Next up, you +might want to: - [Configure](setup.md) Aeromancy to work with Weights and Biases and - S3-compatible blob stores -- TODO: Developing and Debugging in Aeromancy (`bailout`, `--debug`, common + S3-compatible blob stores (production mode) +- (To be documented) Developing and Debugging (`bailout`, `--debug`, common pitfalls, `aeroset`, `aeroview`, `rerun` commands) - [Customizing](customizing.md) your Aeromancy project -- TODO: best practices +- (To be documented) Best practices and FAQ +- (To be documented) Debugging Aeromancy itself (for Aeromancy developers) diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml index 599d6d5..bc54f9e 100644 --- a/docs/mkdocs.yml +++ b/docs/mkdocs.yml @@ -24,6 +24,8 @@ theme: font: text: Open Sans code: Fira Code + features: + - content.code.copy markdown_extensions: - admonition From 86bff9f1c86181838d238bfc66864df3940ae83e Mon Sep 17 00:00:00 2001 From: David McClosky Date: Fri, 23 Feb 2024 11:16:26 -0500 Subject: [PATCH 3/3] Allow AEROMANCY_AWS_REGION to be unset --- src/aeromancy/s3.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/aeromancy/s3.py b/src/aeromancy/s3.py index c2dbc63..85094d2 100644 --- a/src/aeromancy/s3.py +++ b/src/aeromancy/s3.py @@ -361,7 +361,7 @@ def from_env_variables(cls): _S3_CLIENT = cls( aws_access_key_id=os.environ["AEROMANCY_AWS_ACCESS_KEY_ID"], aws_secret_access_key=os.environ["AEROMANCY_AWS_SECRET_ACCESS_KEY"], - region_name=os.environ["AEROMANCY_AWS_REGION"], + region_name=os.environ.get("AEROMANCY_AWS_REGION", ""), endpoint_url=os.environ["AEROMANCY_AWS_S3_ENDPOINT_URL"], ) return _S3_CLIENT