This GitHub Actions workflow is designed to automate the process of building, uploading, and installing a Python package on a Databricks cluster, as well as updating a Databricks repository. The high-level purpose of each step is:
- Checkout: Retrieve the source code from the repository.
- Set up Python 3.9: Prepare the runner environment with Python 3.9, and cache the dependencies to speed up future builds.
- Build wheel: Build the Python package as a wheel file (a distributable package format).
- Install Databricks CLI: Install the Databricks command-line interface (CLI) to interact with the Databricks environment.
- Copy wheel to DBFS: Upload the built wheel file to the Databricks File System (DBFS) under the
/libraries
folder. - Install wheel on cluster: Install the uploaded wheel file on the specified Databricks cluster.
- Update Databricks repo (optional): Update the Databricks repository with the latest changes from the main branch.
The workflow is triggered by a push event or manually using the workflow_dispatch event.
flowchart TD
subgraph "Databricks Environment"
dbfs[(DBFS)]
cluster{{Compute Cluster}}
end
subgraph "GitHub Environment"
subgraph "Runner"
A[1. Checkout Repository] --> B[2. Set up Python 3.9]
B --> C[3. Build wheel]
C --> D[4. Install Databricks CLI]
D --> E[5. Copy wheel to DBFS]
E --> F[6. Install wheel on cluster]
F --> G[7. End Workflow]
end
end
E -.-> dbfs
F -.-> cluster
- Create a well-formed Python package (for example) in a repo on Github
- Generate a Databricks personal access token (PAT); this is the
DATABRICKS_TOKEN
env var - Identify your Databricks workspace URL; this is the
DATABRICKS_HOST
env var (https://<instance-name>.cloud.databricks.com
) - Install the Databricks CLI locally
- Identify the relevant Databricks cluster ID (where you want to install the package) using the CLI:
databricks clusters list
- Identify (optional) the relevant Databricks repo ID (corresponding to the repo for which you are building the package) using the CLI:
databricks repos list
- Create the following Github repository secrets your repo:
DATABRICKS_HOST
DATABRICKS_TOKEN
DATABRICKS_REPO_ID
(optional; remove the "Update databricks repo" step if not relevant)