Add documentation for deploying packaged Kedro projects on Databricks (…

…#2595) * Add deployment workflow page Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add table of contents, entry point guide, data and conf upload guide Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add detailed instructions for creating a job on Databricks Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add images and automated deployment resources Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove use of 'allows', add summary Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove link to missing image Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add deployment workflow to toctree Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Lint and fix missing link Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Minor style, syntax and grammar improvements Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Fixes for correctness during validation Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add instructions for creating log output location Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Lint Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Lint databricks_run Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Minor wording change in reference to logs Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify reference to Pyspark-Iris Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update docs/source/integrations/databricks_deployment_workflow.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Fix linter errors to enable docs build for inspection Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update build-docs.sh * Fix broken link Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/integrations/databricks_deployment_workflow.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/integrations/databricks_deployment_workflow.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/integrations/databricks_deployment_workflow.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/integrations/databricks_deployment_workflow.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Remove spurious word Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update docs/source/integrations/databricks_deployment_workflow.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/integrations/databricks_deployment_workflow.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/integrations/databricks_deployment_workflow.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/integrations/databricks_deployment_workflow.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/integrations/databricks_deployment_workflow.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/integrations/databricks_deployment_workflow.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/integrations/databricks_deployment_workflow.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/integrations/databricks_deployment_workflow.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/integrations/databricks_deployment_workflow.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/integrations/databricks_deployment_workflow.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/integrations/databricks_deployment_workflow.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/integrations/databricks_deployment_workflow.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/integrations/databricks_deployment_workflow.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Add advantages subheading Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update docs/source/integrations/databricks_deployment_workflow.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Add alternative ways to upload data to DBFS Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Move note on unpackaged config and data Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Fix broken links Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Move databricks back into deployment section Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Remove references to PySpark Iris (pyspark-iris) starter Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Graphics links fixes, revise titles Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Fix broken internal link Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Fix links broken by new folder Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Remove logs directory Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update image of final job configuration Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add full stops in list. Co-authored-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com> * Fix conda environment name. Co-authored-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com> * Modify wording and image for creating a new job cluster Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify wording in guide to create new job cluster Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove --upgrade option Co-authored-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com> * Add both ways of creating a new job Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> --------- Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> Signed-off-by: Jo Stichbury <jo_stichbury@mckinsey.com> Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> Co-authored-by: Juan Luis Cano Rodríguez <juan_luis_cano@mckinsey.com>
kedro-org · Jun 1, 2023 · c3c55ac · c3c55ac
1 parent 2fa1478
commit c3c55ac
Show file tree

Hide file tree

Showing 16 changed files with 351 additions and 50 deletions.
diff --git a/docs/source/contribution/development_for_databricks.md b/docs/source/contribution/development_for_databricks.md
@@ -5,7 +5,7 @@ Many Kedro users deploy their projects to [Databricks](https://www.databricks.co
 ## How to deploy a development version of Kedro to Databricks
 
 ```{note}
-This page is for **contributors** developing changes to Kedro that need to test them on Databricks. If you are a Kedro user working on an individual or team project and need more information about workflows, consult the [documentation for developing a Kedro project on Databricks](../integrations/databricks_workspace.md).
+This page is for **contributors** developing changes to Kedro that need to test them on Databricks. If you are a Kedro user working on an individual or team project and need more information about workflows, consult the [documentation pages for developing a Kedro project on Databricks](../deployment/databricks/index.md).
 ```
 
 ## Prerequisites

diff --git a/docs/source/deployment/databricks.md b/docs/source/deployment/databricks.md
diff --git a/docs/source/deployment/databricks/databricks_deployment_workflow.md b/docs/source/deployment/databricks/databricks_deployment_workflow.md
diff --git a/...ations/databricks_development_workflow.md → ...bricks/databricks_development_workflow.md b/...ations/databricks_development_workflow.md → ...bricks/databricks_development_workflow.md
@@ -33,7 +33,7 @@ Note your Databricks **username** and **host** as you will need it for the remai
 
 Find your Databricks username in the top right of the workspace UI and the host in the browser's URL bar, up to the first slash (e.g., `https://adb-123456789123456.1.azuredatabricks.net/`):
 
-![Find Databricks host and username](../meta/images/find_databricks_host_and_username.png)
+![Find Databricks host and username](../../meta/images/find_databricks_host_and_username.png)
 
 ```{note}
 Your databricks host must include the protocol (`https://`).
@@ -90,7 +90,7 @@ Create a new repo on Databricks by navigating to `New` tab in the Databricks wor
 
 In this guide, you will not sync your project with a remote Git provider, so uncheck `Create repo by cloning a Git repository` and enter `iris-databricks` as the name of your new repository:
 
-![Create a new repo on Databricks](../meta/images/databricks_repo_creation.png)
+![Create a new repo on Databricks](../../meta/images/databricks_repo_creation.png)
 
 ### Sync code with your Databricks repo using dbx
 
@@ -128,15 +128,15 @@ Kedro requires your project to have a `conf/local` directory to exist to success
 
 Open the Databricks workspace UI and using the panel on the left, navigate to `Repos -> <databricks_username> -> iris-databricks -> conf`, right click and select `Create -> Folder` as in the image below:
 
-![Create a conf folder in Databricks repo](../meta/images/databricks_conf_folder_creation.png)
+![Create a conf folder in Databricks repo](../../meta/images/databricks_conf_folder_creation.png)
 
 Name the new folder `local`. In this guide, we have no local credentials to store and so we will leave the newly created folder empty. Your `conf/local` and `local` directories should now look like the following:
 
-![Final conf folder](../meta/images/final_conf_folder.png)
+![Final conf folder](../../meta/images/final_conf_folder.png)
 
 ### Upload project data to DBFS
 
-When run on Databricks, Kedro cannot access data stored in your project's directory. Therefore, you will need to upload your project's data to an accessible location. In this guide, we will store the data on the Databricks File System (DBFS). The PySpark Iris starter contains an environment that is set up to access data stored in DBFS (`conf/databricks`). To learn more about environments in Kedro configuration, see the [configuration documentation](../configuration/configuration_basics.md#configuration-environments).
+When run on Databricks, Kedro cannot access data stored in your project's directory. Therefore, you will need to upload your project's data to an accessible location. In this guide, we will store the data on the Databricks File System (DBFS). The PySpark Iris starter contains an environment that is set up to access data stored in DBFS (`conf/databricks`). To learn more about environments in Kedro configuration, see the [configuration documentation](../../configuration/configuration_basics.md#configuration-environments).
 
 There are several ways to upload data to DBFS. In this guide, it is recommended to use [Databricks CLI](https://docs.databricks.com/dev-tools/cli/dbfs-cli.html) because of the convenience it offers. At the command line in your local environment, use the following Databricks CLI command to upload your locally stored data to DBFS:
 
@@ -169,7 +169,7 @@ Now that your project is available on Databricks, you can run it on a cluster us
 
 To run the Python code from your Databricks repo, [create a new Python notebook](https://docs.databricks.com/notebooks/notebooks-manage.html#create-a-notebook) in your workspace. Name it `iris-databricks` for traceability and attach it to your cluster:
 
-![Create a new notebook on Databricks](../meta/images/databricks_notebook_creation.png)
+![Create a new notebook on Databricks](../../meta/images/databricks_notebook_creation.png)
 
 ### Run your project
 
@@ -201,15 +201,15 @@ session.run()
 
 After completing these steps, your notebook should match the following image:
 
-![Databricks completed notebook](../meta/images/databricks_finished_notebook.png)
+![Databricks completed notebook](../../meta/images/databricks_finished_notebook.png)
 
 Run the completed notebook using the `Run All` bottom in the top right of the UI:
 
-![Databricks notebook run all](../meta/images/databricks_run_all.png)
+![Databricks notebook run all](../../meta/images/databricks_run_all.png)
 
 On your first run, you will be prompted to consent to analytics, type `y` or `N` in the field that appears and press `Enter`:
 
-![Databricks notebook telemetry consent](../meta/images/databricks_telemetry_consent.png)
+![Databricks notebook telemetry consent](../../meta/images/databricks_telemetry_consent.png)
 
 You should see logging output while the cell is running. After execution finishes, you should see output similar to the following:
 

diff --git a/.../integrations/databricks_visualisation.md → ...nt/databricks/databricks_visualisation.md b/.../integrations/databricks_visualisation.md → ...nt/databricks/databricks_visualisation.md
@@ -1,6 +1,6 @@
 # Visualise a Kedro project in Databricks notebooks
 
-[Kedro-Viz](../visualisation/kedro-viz_visualisation.md) is a tool that enables you to visualise your Kedro pipeline and metrics generated from your data science experiments. It is a standalone web application that runs on a web browser, it can be run on a local machine or in Databricks notebooks.
+[Kedro-Viz](../../visualisation/kedro-viz_visualisation.md) is a tool that enables you to visualise your Kedro pipeline and metrics generated from your data science experiments. It is a standalone web application that runs on a web browser, it can be run on a local machine or in Databricks notebooks.
 
 For Kedro-Viz to run with your Kedro project, you need to ensure that both the packages are installed in the same scope (notebook-scoped vs. cluster library). This means that if you `%pip install kedro` from inside your notebook then you should also `%pip install kedro-viz` from inside your notebook.
 If your cluster comes with Kedro installed on it as a library already then you should also add Kedro-Viz as a [cluster library](https://docs.microsoft.com/en-us/azure/databricks/libraries/cluster-libraries).
@@ -15,8 +15,8 @@ Kedro-Viz can then be launched in a new browser tab with the `%run_viz` line mag
 
 This command presents you with a link to the Kedro-Viz web application.
 
-![databricks_viz_link](../meta/images/databricks_viz_link.png)
+![databricks_viz_link](../../meta/images/databricks_viz_link.png)
 
 Clicking this link opens a new browser tab running Kedro-Viz for your project.
 
-![databricks_viz_demo](../meta/images/databricks_viz_demo.png)
+![databricks_viz_demo](../../meta/images/databricks_viz_demo.png)
diff --git a/...urce/integrations/databricks_workspace.md → ...oyment/databricks/databricks_workspace.md b/...urce/integrations/databricks_workspace.md → ...oyment/databricks/databricks_workspace.md
@@ -1,13 +1,13 @@
-# Develop a project with Databricks Workspace and Notebooks
+# Databricks notebooks workflow
 
 This tutorial uses the [PySpark Iris Kedro Starter](https://github.com/kedro-org/kedro-starters/tree/main/pyspark-iris) to illustrate how to bootstrap a Kedro project using Spark and deploy it to a [Databricks cluster on AWS](https://databricks.com/aws).
 
 ```{note}
-If you are using [Databricks Repos](https://docs.databricks.com/repos/index.html) to run a Kedro project then you should [disable file-based logging](../logging/logging.md#disable-file-based-logging). This prevents Kedro from attempting to write to the read-only file system.
+If you are using [Databricks Repos](https://docs.databricks.com/repos/index.html) to run a Kedro project then you should [disable file-based logging](../../logging/logging.md#disable-file-based-logging). This prevents Kedro from attempting to write to the read-only file system.
 ```
 
 ```{note}
-If you are a Kedro contributor looking for information on deploying a custom build of Kedro to Databricks, see the [development guide](../contribution/development_for_databricks.md).
+If you are a Kedro contributor looking for information on deploying a custom build of Kedro to Databricks, see the [development guide](../../contribution/development_for_databricks.md).
 ```
 
 ## Prerequisites
@@ -144,11 +144,11 @@ The project has now been pushed to your private GitHub repository, and in order
 3. Press `Edit`
 4. Go to the `Advanced Options` and then `Spark`
 
-![](../meta/images/databricks_cluster_edit.png)
+![](../../meta/images/databricks_cluster_edit.png)
 
 Then in the `Environment Variables` section add your `GITHUB_USER` and `GITHUB_TOKEN` as shown on the picture:
 
-![](../meta/images/databricks_cluster_env_vars.png)
+![](../../meta/images/databricks_cluster_env_vars.png)
 
 
 ```{note}
@@ -227,16 +227,16 @@ You should get a similar output:
 
 Your complete notebook should look similar to this (the results are hidden):
 
-![](../meta/images/databricks_notebook_example.png)
+![](../../meta/images/databricks_notebook_example.png)
 
 
 ### 9. Using the Kedro IPython Extension
 
 You can interact with Kedro in Databricks through the Kedro [IPython extension](https://ipython.readthedocs.io/en/stable/config/extensions/index.html), `kedro.ipython`.
 
-The Kedro IPython extension launches a [Kedro session](../kedro_project_setup/session.md) and makes available the useful Kedro variables `catalog`, `context`, `pipelines` and `session`. It also provides the `%reload_kedro` [line magic](https://ipython.readthedocs.io/en/stable/interactive/magics.html) that reloads these variables (for example, if you need to update `catalog` following changes to your Data Catalog).
+The Kedro IPython extension launches a [Kedro session](../../kedro_project_setup/session.md) and makes available the useful Kedro variables `catalog`, `context`, `pipelines` and `session`. It also provides the `%reload_kedro` [line magic](https://ipython.readthedocs.io/en/stable/interactive/magics.html) that reloads these variables (for example, if you need to update `catalog` following changes to your Data Catalog).
 
-The IPython extension can be used in a Databricks notebook in a similar way to how it is used in [Jupyter notebooks](../notebooks_and_ipython/kedro_and_notebooks.md).
+The IPython extension can be used in a Databricks notebook in a similar way to how it is used in [Jupyter notebooks](../../notebooks_and_ipython/kedro_and_notebooks.md).
 
 If you encounter a `ContextualVersionConflictError`, it is likely caused by Databricks using an old version of `pip`. Hence there's one additional step you need to do in the Databricks notebook to make use of the IPython extension. After you load the IPython extension using the below command:
 

diff --git a/docs/source/deployment/databricks/index.md b/docs/source/deployment/databricks/index.md
@@ -0,0 +1,11 @@
+# Databricks
+
+
+```{toctree}
+:maxdepth: 1
+
+databricks_workspace.md
+databricks_visualisation
+databricks_development_workflow
+databricks_deployment_workflow
+```
diff --git a/docs/source/deployment/index.md b/docs/source/deployment/index.md
@@ -30,7 +30,7 @@ This following pages provide information for deployment to, or integration with,
 * [AWS Step functions](aws_step_functions.md)
 * [Azure](azure.md)
 * [Dask](dask.md)
-* [Databricks](../integrations/databricks_workspace.md)
+* [Databricks](./databricks/index.md)
 * [Kubeflow Workflows](kubeflow.md)
 * [Prefect](prefect.md)
 * [Vertex AI](vertexai.md)
@@ -55,7 +55,7 @@ amazon_sagemaker
 aws_step_functions
 azure
 dask
-databricks
+databricks/index
 kubeflow
 prefect
 vertexai

diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -125,7 +125,7 @@ Welcome to Kedro's documentation!
 .. toctree::
    :maxdepth: 2
 
-   integrations/index.md
+   integrations/pyspark_integration.md
 
 .. toctree::
    :maxdepth: 2

diff --git a/docs/source/integrations/index.md b/docs/source/integrations/index.md
diff --git a/docs/source/integrations/pyspark_integration.md b/docs/source/integrations/pyspark_integration.md
@@ -1,4 +1,4 @@
-# Build a Kedro pipeline with PySpark
+# PySpark integration
 
 This page outlines some best practices when building a Kedro pipeline with [`PySpark`](https://spark.apache.org/docs/latest/api/python/index.html). It assumes a basic understanding of both Kedro and `PySpark`.
 

diff --git a/docs/source/meta/images/databricks_configure_job_cluster.png b/docs/source/meta/images/databricks_configure_job_cluster.png
diff --git a/docs/source/meta/images/databricks_configure_new_job.png b/docs/source/meta/images/databricks_configure_new_job.png
diff --git a/docs/source/meta/images/databricks_create_job_cluster.png b/docs/source/meta/images/databricks_create_job_cluster.png
diff --git a/docs/source/meta/images/databricks_create_new_job.png b/docs/source/meta/images/databricks_create_new_job.png
diff --git a/docs/source/meta/images/databricks_job_status.png b/docs/source/meta/images/databricks_job_status.png