Skip to content

Commit

Permalink
Add documentation for deploying packaged Kedro projects on Databricks (
Browse files Browse the repository at this point in the history
…#2595)

* Add deployment workflow page

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add table of contents, entry point guide, data and conf upload guide

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add detailed instructions for creating a job on Databricks

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add images and automated deployment resources

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Remove use of 'allows', add summary

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Remove link to missing image

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add deployment workflow to toctree

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Lint and fix missing link

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Minor style, syntax and grammar improvements

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Fixes for correctness during validation

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add instructions for creating log output location

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Lint

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Lint databricks_run

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Minor wording change in reference to logs

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify reference to Pyspark-Iris

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update docs/source/integrations/databricks_deployment_workflow.md

Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Fix linter errors to enable docs build for inspection

Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Update build-docs.sh

* Fix broken link

Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Update docs/source/integrations/databricks_deployment_workflow.md

Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Update docs/source/integrations/databricks_deployment_workflow.md

Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Update docs/source/integrations/databricks_deployment_workflow.md

Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Update docs/source/integrations/databricks_deployment_workflow.md

Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Remove spurious word

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update docs/source/integrations/databricks_deployment_workflow.md

Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Update docs/source/integrations/databricks_deployment_workflow.md

Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Update docs/source/integrations/databricks_deployment_workflow.md

Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Update docs/source/integrations/databricks_deployment_workflow.md

Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Update docs/source/integrations/databricks_deployment_workflow.md

Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Update docs/source/integrations/databricks_deployment_workflow.md

Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Update docs/source/integrations/databricks_deployment_workflow.md

Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Update docs/source/integrations/databricks_deployment_workflow.md

Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Update docs/source/integrations/databricks_deployment_workflow.md

Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Update docs/source/integrations/databricks_deployment_workflow.md

Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Update docs/source/integrations/databricks_deployment_workflow.md

Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Update docs/source/integrations/databricks_deployment_workflow.md

Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Update docs/source/integrations/databricks_deployment_workflow.md

Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Add advantages subheading

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update docs/source/integrations/databricks_deployment_workflow.md

Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Add alternative ways to upload data to DBFS

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Move note on unpackaged config and data

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Fix broken links

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Move databricks back into deployment section

Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Remove references to PySpark Iris (pyspark-iris) starter

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Graphics links fixes, revise titles

Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Fix broken internal link

Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Fix links broken by new folder

Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com>

* Remove logs directory

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Update image of final job configuration

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Add full stops in list.

Co-authored-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>

* Fix conda environment name.

Co-authored-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>

* Modify wording and image for creating a new job cluster

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Modify wording in guide to create new job cluster

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

* Remove --upgrade option

Co-authored-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>

* Add both ways of creating a new job

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>

---------

Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com>
Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>
Co-authored-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
  • Loading branch information
3 people authored Jun 1, 2023
1 parent 2fa1478 commit c3c55ac
Show file tree
Hide file tree
Showing 16 changed files with 351 additions and 50 deletions.
2 changes: 1 addition & 1 deletion docs/source/contribution/development_for_databricks.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Many Kedro users deploy their projects to [Databricks](https://www.databricks.co
## How to deploy a development version of Kedro to Databricks

```{note}
This page is for **contributors** developing changes to Kedro that need to test them on Databricks. If you are a Kedro user working on an individual or team project and need more information about workflows, consult the [documentation for developing a Kedro project on Databricks](../integrations/databricks_workspace.md).
This page is for **contributors** developing changes to Kedro that need to test them on Databricks. If you are a Kedro user working on an individual or team project and need more information about workflows, consult the [documentation pages for developing a Kedro project on Databricks](../deployment/databricks/index.md).
```

## Prerequisites
Expand Down
4 changes: 0 additions & 4 deletions docs/source/deployment/databricks.md

This file was deleted.

315 changes: 315 additions & 0 deletions docs/source/deployment/databricks/databricks_deployment_workflow.md

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ Note your Databricks **username** and **host** as you will need it for the remai

Find your Databricks username in the top right of the workspace UI and the host in the browser's URL bar, up to the first slash (e.g., `https://adb-123456789123456.1.azuredatabricks.net/`):

![Find Databricks host and username](../meta/images/find_databricks_host_and_username.png)
![Find Databricks host and username](../../meta/images/find_databricks_host_and_username.png)

```{note}
Your databricks host must include the protocol (`https://`).
Expand Down Expand Up @@ -90,7 +90,7 @@ Create a new repo on Databricks by navigating to `New` tab in the Databricks wor

In this guide, you will not sync your project with a remote Git provider, so uncheck `Create repo by cloning a Git repository` and enter `iris-databricks` as the name of your new repository:

![Create a new repo on Databricks](../meta/images/databricks_repo_creation.png)
![Create a new repo on Databricks](../../meta/images/databricks_repo_creation.png)

### Sync code with your Databricks repo using dbx

Expand Down Expand Up @@ -128,15 +128,15 @@ Kedro requires your project to have a `conf/local` directory to exist to success

Open the Databricks workspace UI and using the panel on the left, navigate to `Repos -> <databricks_username> -> iris-databricks -> conf`, right click and select `Create -> Folder` as in the image below:

![Create a conf folder in Databricks repo](../meta/images/databricks_conf_folder_creation.png)
![Create a conf folder in Databricks repo](../../meta/images/databricks_conf_folder_creation.png)

Name the new folder `local`. In this guide, we have no local credentials to store and so we will leave the newly created folder empty. Your `conf/local` and `local` directories should now look like the following:

![Final conf folder](../meta/images/final_conf_folder.png)
![Final conf folder](../../meta/images/final_conf_folder.png)

### Upload project data to DBFS

When run on Databricks, Kedro cannot access data stored in your project's directory. Therefore, you will need to upload your project's data to an accessible location. In this guide, we will store the data on the Databricks File System (DBFS). The PySpark Iris starter contains an environment that is set up to access data stored in DBFS (`conf/databricks`). To learn more about environments in Kedro configuration, see the [configuration documentation](../configuration/configuration_basics.md#configuration-environments).
When run on Databricks, Kedro cannot access data stored in your project's directory. Therefore, you will need to upload your project's data to an accessible location. In this guide, we will store the data on the Databricks File System (DBFS). The PySpark Iris starter contains an environment that is set up to access data stored in DBFS (`conf/databricks`). To learn more about environments in Kedro configuration, see the [configuration documentation](../../configuration/configuration_basics.md#configuration-environments).

There are several ways to upload data to DBFS. In this guide, it is recommended to use [Databricks CLI](https://docs.databricks.com/dev-tools/cli/dbfs-cli.html) because of the convenience it offers. At the command line in your local environment, use the following Databricks CLI command to upload your locally stored data to DBFS:

Expand Down Expand Up @@ -169,7 +169,7 @@ Now that your project is available on Databricks, you can run it on a cluster us

To run the Python code from your Databricks repo, [create a new Python notebook](https://docs.databricks.com/notebooks/notebooks-manage.html#create-a-notebook) in your workspace. Name it `iris-databricks` for traceability and attach it to your cluster:

![Create a new notebook on Databricks](../meta/images/databricks_notebook_creation.png)
![Create a new notebook on Databricks](../../meta/images/databricks_notebook_creation.png)

### Run your project

Expand Down Expand Up @@ -201,15 +201,15 @@ session.run()

After completing these steps, your notebook should match the following image:

![Databricks completed notebook](../meta/images/databricks_finished_notebook.png)
![Databricks completed notebook](../../meta/images/databricks_finished_notebook.png)

Run the completed notebook using the `Run All` bottom in the top right of the UI:

![Databricks notebook run all](../meta/images/databricks_run_all.png)
![Databricks notebook run all](../../meta/images/databricks_run_all.png)

On your first run, you will be prompted to consent to analytics, type `y` or `N` in the field that appears and press `Enter`:

![Databricks notebook telemetry consent](../meta/images/databricks_telemetry_consent.png)
![Databricks notebook telemetry consent](../../meta/images/databricks_telemetry_consent.png)

You should see logging output while the cell is running. After execution finishes, you should see output similar to the following:

Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Visualise a Kedro project in Databricks notebooks

[Kedro-Viz](../visualisation/kedro-viz_visualisation.md) is a tool that enables you to visualise your Kedro pipeline and metrics generated from your data science experiments. It is a standalone web application that runs on a web browser, it can be run on a local machine or in Databricks notebooks.
[Kedro-Viz](../../visualisation/kedro-viz_visualisation.md) is a tool that enables you to visualise your Kedro pipeline and metrics generated from your data science experiments. It is a standalone web application that runs on a web browser, it can be run on a local machine or in Databricks notebooks.

For Kedro-Viz to run with your Kedro project, you need to ensure that both the packages are installed in the same scope (notebook-scoped vs. cluster library). This means that if you `%pip install kedro` from inside your notebook then you should also `%pip install kedro-viz` from inside your notebook.
If your cluster comes with Kedro installed on it as a library already then you should also add Kedro-Viz as a [cluster library](https://docs.microsoft.com/en-us/azure/databricks/libraries/cluster-libraries).
Expand All @@ -15,8 +15,8 @@ Kedro-Viz can then be launched in a new browser tab with the `%run_viz` line mag

This command presents you with a link to the Kedro-Viz web application.

![databricks_viz_link](../meta/images/databricks_viz_link.png)
![databricks_viz_link](../../meta/images/databricks_viz_link.png)

Clicking this link opens a new browser tab running Kedro-Viz for your project.

![databricks_viz_demo](../meta/images/databricks_viz_demo.png)
![databricks_viz_demo](../../meta/images/databricks_viz_demo.png)
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
# Develop a project with Databricks Workspace and Notebooks
# Databricks notebooks workflow

This tutorial uses the [PySpark Iris Kedro Starter](https://github.com/kedro-org/kedro-starters/tree/main/pyspark-iris) to illustrate how to bootstrap a Kedro project using Spark and deploy it to a [Databricks cluster on AWS](https://databricks.com/aws).

```{note}
If you are using [Databricks Repos](https://docs.databricks.com/repos/index.html) to run a Kedro project then you should [disable file-based logging](../logging/logging.md#disable-file-based-logging). This prevents Kedro from attempting to write to the read-only file system.
If you are using [Databricks Repos](https://docs.databricks.com/repos/index.html) to run a Kedro project then you should [disable file-based logging](../../logging/logging.md#disable-file-based-logging). This prevents Kedro from attempting to write to the read-only file system.
```

```{note}
If you are a Kedro contributor looking for information on deploying a custom build of Kedro to Databricks, see the [development guide](../contribution/development_for_databricks.md).
If you are a Kedro contributor looking for information on deploying a custom build of Kedro to Databricks, see the [development guide](../../contribution/development_for_databricks.md).
```

## Prerequisites
Expand Down Expand Up @@ -144,11 +144,11 @@ The project has now been pushed to your private GitHub repository, and in order
3. Press `Edit`
4. Go to the `Advanced Options` and then `Spark`

![](../meta/images/databricks_cluster_edit.png)
![](../../meta/images/databricks_cluster_edit.png)

Then in the `Environment Variables` section add your `GITHUB_USER` and `GITHUB_TOKEN` as shown on the picture:

![](../meta/images/databricks_cluster_env_vars.png)
![](../../meta/images/databricks_cluster_env_vars.png)


```{note}
Expand Down Expand Up @@ -227,16 +227,16 @@ You should get a similar output:

Your complete notebook should look similar to this (the results are hidden):

![](../meta/images/databricks_notebook_example.png)
![](../../meta/images/databricks_notebook_example.png)


### 9. Using the Kedro IPython Extension

You can interact with Kedro in Databricks through the Kedro [IPython extension](https://ipython.readthedocs.io/en/stable/config/extensions/index.html), `kedro.ipython`.

The Kedro IPython extension launches a [Kedro session](../kedro_project_setup/session.md) and makes available the useful Kedro variables `catalog`, `context`, `pipelines` and `session`. It also provides the `%reload_kedro` [line magic](https://ipython.readthedocs.io/en/stable/interactive/magics.html) that reloads these variables (for example, if you need to update `catalog` following changes to your Data Catalog).
The Kedro IPython extension launches a [Kedro session](../../kedro_project_setup/session.md) and makes available the useful Kedro variables `catalog`, `context`, `pipelines` and `session`. It also provides the `%reload_kedro` [line magic](https://ipython.readthedocs.io/en/stable/interactive/magics.html) that reloads these variables (for example, if you need to update `catalog` following changes to your Data Catalog).

The IPython extension can be used in a Databricks notebook in a similar way to how it is used in [Jupyter notebooks](../notebooks_and_ipython/kedro_and_notebooks.md).
The IPython extension can be used in a Databricks notebook in a similar way to how it is used in [Jupyter notebooks](../../notebooks_and_ipython/kedro_and_notebooks.md).

If you encounter a `ContextualVersionConflictError`, it is likely caused by Databricks using an old version of `pip`. Hence there's one additional step you need to do in the Databricks notebook to make use of the IPython extension. After you load the IPython extension using the below command:

Expand Down
11 changes: 11 additions & 0 deletions docs/source/deployment/databricks/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Databricks


```{toctree}
:maxdepth: 1
databricks_workspace.md
databricks_visualisation
databricks_development_workflow
databricks_deployment_workflow
```
4 changes: 2 additions & 2 deletions docs/source/deployment/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ This following pages provide information for deployment to, or integration with,
* [AWS Step functions](aws_step_functions.md)
* [Azure](azure.md)
* [Dask](dask.md)
* [Databricks](../integrations/databricks_workspace.md)
* [Databricks](./databricks/index.md)
* [Kubeflow Workflows](kubeflow.md)
* [Prefect](prefect.md)
* [Vertex AI](vertexai.md)
Expand All @@ -55,7 +55,7 @@ amazon_sagemaker
aws_step_functions
azure
dask
databricks
databricks/index
kubeflow
prefect
vertexai
Expand Down
2 changes: 1 addition & 1 deletion docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ Welcome to Kedro's documentation!
.. toctree::
:maxdepth: 2

integrations/index.md
integrations/pyspark_integration.md

.. toctree::
:maxdepth: 2
Expand Down
21 changes: 0 additions & 21 deletions docs/source/integrations/index.md

This file was deleted.

2 changes: 1 addition & 1 deletion docs/source/integrations/pyspark_integration.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Build a Kedro pipeline with PySpark
# PySpark integration

This page outlines some best practices when building a Kedro pipeline with [`PySpark`](https://spark.apache.org/docs/latest/api/python/index.html). It assumes a basic understanding of both Kedro and `PySpark`.

Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit c3c55ac

Please sign in to comment.