Skip to content
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions .github/workflows/deploy-book.yml
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,9 @@ jobs:
cat EXTRA_CACHE_VARS.txt

- name: Cache page build
# An id is created here to check if build is required for later steps with: if steps.cache-html.outputs.cache-hit != 'true'
# A build is required if a change is made to the book contents, requirements file or the status of
# a branch as archive or preprocess has changed (`BRANCHES_TO_PREPROCESS` of `BRANCHES_ARCHIVED`).
id: cache-html
uses: actions/cache@v4
with:
Expand All @@ -183,13 +186,28 @@ jobs:
run: |
echo "WEEK=$(date +%V)" >> $GITHUB_ENV

# If a build is required, load the cached venv, if available
# A cached environment is loaded if the requirements.txt file is unchanged, as this is checked by the file hash.
# Note: an id is not required here because no check for the cached environment is needed; this is because a new environment
# is only required if a build is required and this is already checked with id: cache-html, as described above.
- if: ${{ steps.cache-html.outputs.cache-hit != 'true' }}
name: Cache virtualenv
uses: actions/cache@v4
with:
key: venv-${{ runner.os }}-${{ steps.setup-python.outputs.python-version }}-${{ hashFiles('requirements.txt') }}-week${{ env.WEEK }}-${{ github.repository }}
path: .venv

# Dependencies are always checked if a build is required, regardless of whether a cached environment is used
# Either cache situation addressed with: python -m venv .venv
# - If a cached environment was loaded, .venv will be reused;
# - If not loaded, a new venv is created
# Note that for an existing cached venv when using: pip install -r requirements.txt
# - Packages will be checked for compatability and updated if a new package was added to requirements.txt
# - If no new packages are added, or the new package is compatible with the other packages,
# UPDATES TO EXISTING PACKAGES WILL NOT BE MADE AUTOMATICALLY, EVEN IF A NEW VERSION EXISTS
# - The significance of the last point is that, when compared to the venv on a newer branch, a
# frequently used branch (where the cache has not expired) may retain old and out of date packages
# long after an updated version is available. Recommendation is to specifically pin version numbers.
- if: ${{ steps.cache-html.outputs.cache-hit != 'true' }}
name: Install dependencies
run: |
Expand Down
53 changes: 53 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -171,5 +171,58 @@ Here's an example for a summary for the template book:
> BRANCHES_ARCHIVED= (default value used)
> ```

## Book Build

_WIP_

## Caching

The GitHub Action [Cache](https://github.com/actions/cache) is used in the Deploy Book Workflow (DBW) to save time when building the book. This is accompliched by caching two sets of files for each branch: 1) the Python virtual environment, and 2) the build artifact (the HTML files forming each book website).

The reason for doing this stems from the primary purpose of the Deploy Book Workflow, which builds many versions of a book based on each branch, as well as creating a Python virtual environment from the `requirements.txt` file in the repository. A unique environment for each branch is necessary to test changes to book dependencies and configuration in an isolated way, but can dramatically increase the time required to build all versions of the book when the DBW runs. In addition, it also takes time to build the book itself. For this reason, both of these sets of files are cached (although at the moment, most of the time savings is primarily caching the build artifact).

This works by hashing a specific set of files and including that in the filename of the cache. Every time the workflow is triggered, a new hashed filename is created and compared to the list of existing caches to see if one with the same name exists; if so, the cache is reused. If not, a new environment and/or build artifact is constructed and a new cache is made.

Cached files can be found in GitHub under the Actions tab, then looking for the "Caches" section under the "Management" pane on the left hand side of the screen. For example, this link shows the [caches for the TeachBooks Manual](https://github.com/TeachBooks/manual/actions/caches). If needed, caches can be deleted manually from this page. A cache will expire if unused for longer than 1 week or if the maximum allowed disk space for all cached files is exceeded (both are GitHub policies).

Files created during a GitHub Action are called [artifacts](https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/storing-and-sharing-data-from-a-workflow). Although this is _not_ the same thing as a cache, the files in a specific cache or artifact are identical. Artifacts have their own expiration period, which is possible to [customize in the repository settings](https://docs.github.com/en/organizations/managing-organization-settings/configuring-the-retention-period-for-github-actions-artifacts-and-logs-in-your-organization) (default is 90 days). Unexpired artifacts can be viewed at the bottom of any workflow run summary that is available on the Actions tab of a GitHub repository.

To add even more confusion to the situation, if an artifact built during an Action is a set of HTML files (a website) that are used to create a GitHub Pages website, the files served when visiting the URL for that are stored on a webserver; they are _not_ the artifact files, but are identical. Fortunately webserver files are preserved indefinitely (or until GitHub changes this policy), meaning that even though your cache or artifact may be deleted, the website will still remain active.

The cache is a key component of the DBW and leads to several considerations for the building and maintenance of a book, which are explained below, after the criteria for each cache type are described.

### Cached Environment

The DBW uses [Python virtual environments](https://docs.python.org/3/library/venv.html) (`python -m venv .venv`) and `python -m pip install -r requirements.txt` to install packages and create the book building environment. The entire `.venv` directory is preserved in the cache with a file name beginning with `venv-...`.

Once created during a successful build action, an existing cached environment will be found and reused unless two specific criteria are met: 1) a book build is required (replacing the cached build artifact), _and_ 2) the `requirements.txt` file is changed. Requirements for a "book build" are described in the next section.

Note that any change to the `requirements.txt` file will trigger creation a new environment, not necessarily one that specifies the version number of a package.

### Cached Build Artifacts

Build artifacts are the HTML files that define the website (i.e., the book) and are typically located in the subdirectory `book/_build/html` of a repository once the book is built. Note that these files are typically only visible if you are building the book locally, as they are not committed to the repository by default (i.e., `.gitignore`). For users only using the DBW and the GitHub browswer editor, downloading the build cache is the easiest way to view these files, which have filenames beginnig with `html-build-<branch-name>-...`.

Once created during a successful build action, an existing cached build artifact will be found and reused unless _any_ of three specific criteria are met: 1) a change is made to a file in `book/`, 2) the `requirements.txt` file is changed, or 3) the status of a branch as archive or preprocess has changed (`BRANCHES_TO_PREPROCESS` of `BRANCHES_ARCHIVED`).

Immediately after a successful build action, the book website files exist in three places: cache, artifact and on a GitHub webserver. However, as described at the beginning of this section, the cache will be automatically deleted after 1 week, if unused, and the artifact will be deleted based on the setting in your repository (90 days by default).

### Effect of Caching on Book Workflow

Implementation of the DBW has several characteristics that should be noted which could help understand certain behaviors of the book build and website deployment process. These points may be useful to consider if you are experiencing undesired behavior in your Actions builds and/or your actual book websites.

- Remember that the DBW action will only run if a commit is made that changes `requirements.txt` or contents of `book/`
- When first creating a repository, the action may not run, or may run and fail; occasionally it is needed to make a new commit to get the workflow to run for the first time (remember to modify something in `book/`) or you may try to re-run the job from the Actions tab
- when using multiple branches, only the branch that is edited will be updated
- the website for each branch may be built with a different Python environments (different package versions)

Unlike the virtual environments (next subsection), caches of the book build do not influence subsequent builds of the book, as this artifact is replaced whenever a commit to a specific branch triggers a new build process. As described above, old cached build artifacts are used if the build process from a new commit fails, ensuring that the URL of a website does not return a 404 page.

### Effect of Caching on Virtual Environment

In particular, note that older branches may have been built with cached environments that are different than those in a newer branch, leading to (undesired) differences in book appearance or functionality---often without the author being aware! This occurs if a package in `requirements.txt` is updated in the time between the creation of environments on different branches, as `pip` will use the newest version when downloading a package (the first time a venv cache is created in the DBW), but it will not always automatically update a package if `pip install -r requirements.txt` is used on an existing environment (this happens every time an existing cache is used!). This means that, when compared to the venv on a newer branch, a frequently used branch (where the cache has not expired) may retain old and out of date packages long after an updated version is available. A best practice to avoid this situation is to pin version numbers explicitly (e.g., `teachbooks==0.2.0`) if you want book builds to remain consistent. In addition, a feature like dependabot can be used to automatically notify you when an update is made; this is [described in the TeachBooks Manual](https://teachbooks.io/manual/features/update_env.html).

## Contribute
This tool's repository is stored on [GitHub](https://github.com/TeachBooks/deploy-book-workflow). The `README.md` of the branch `manual_docs` is also part of the [TeachBooks manual](https://teachbooks.io/manual/external/deploy-book-workflow/README.html) as a submodule. If you'd like to contribute, you can create a fork and open a pull request on the [GitHub repository](https://github.com/TeachBooks/deploy-book-workflow). To update the `README.md` shown in the TeachBooks manual, create a fork and open a merge request for the [GitHub repository of the manual](https://github.com/TeachBooks/manual). If you intent to clone the manual including its submodules, clone using: `git clone --recurse-submodulesgit@github.com:TeachBooks/manual.git`.

Future improvements to the DBW may address the ability for a user to specify additional aspects of the build process, for example, the Python version or the specific book build commands (e.g., `teachbooks build` or `jupyter-book build`). If this is something that interests you, please create an Issue in the repository (perhaps with "feature request" in the title).