-
Notifications
You must be signed in to change notification settings - Fork 906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add automated tooling and a guide for building and installing Kedro on a Databricks cluster #2369
Conversation
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
|
||
1. Python **version >=3.8** installed. | ||
2. A Python virtual environment activated in which the [Databricks CLI](https://docs.databricks.com/dev-tools/cli/index.html) is installed with [authentication for your workspace](https://docs.databricks.com/dev-tools/cli/index.html#set-up-the-cli). | ||
3. Access to a Databricks workspace with an [existing cluster](https://docs.databricks.com/clusters/create-cluster.html). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3. Access to a Databricks workspace with an [existing cluster](https://docs.databricks.com/clusters/create-cluster.html). | |
3. Access to a Databricks workspace on an [existing cluster](https://docs.databricks.com/clusters/create-cluster.html). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No mention of type of cluster, e.g. AWS? We have a deployment guide and it seems to only support AWS at present, so is that a prerequisite or can any cloud provider cluster be used? Probably worth saying or pointing to a relevant databricks link?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point, though creating a cluster on Databricks is actually a cloud-agnostic process on Databricks. The linked documentation actually covers the topic really well and the user can switch the cloud provider in the top right to view the (very minor) differences for their specific cloud. For this reason I don't think it's important that we specify a cloud here.
…ro-org/kedro into feat/add-databricks-build-automation Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Awesome, thanks so much for the feedback @stichbury. I have addressed the comments you left in my latest round of commits. |
This page is for people developing Kedro, users working on their own Kedro projects should see the documentation for [deploying Kedro projects on Databricks](../deployment/databricks.md). | ||
``` | ||
|
||
This guide describes how to efficiently develop features and fixes for Kedro on Databricks. Using this guide, you will be able to quickly test your locally modified version of Kedro on Databricks as part of a build-and-test development cycle. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be tempted to move these two sentences to precede the note rather than follow it, but will leave it up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair, it's probably bad form to start a section with a note. Fixed this.
4. Queues your updated wheel file for installation | ||
5. Restarts your cluster to apply the changes. | ||
|
||
Note that your cluster will be unavailable while it restarts. You can poll the status of the cluster using the Databricks CLI command `databricks clusters get --cluster-id <your-cluster-id> | grep state` (macOS, Linux). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not clear what Windows users do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right about this, I'm sorry I missed it. I updated both this command and the previous Windows one with PowerShell commands that I verified as working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Thanks for the updates. Just a couple of minor comments, but this looks great to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks awesome @jmholzer! 🌟 Left some minor comments, and have tested it manually and it all seems to work. Haven't gone into the DataBricks API code that deeply but it all looks good and will look into it further during the session.
Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>
Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com>
} | ||
``` | ||
|
||
Any runs of a Kedro project on this cluster will now reflect your latest local changes to Kedro. You can now test your changes to Kedro by using your cluster to run a Kedro project. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing that with any new change you'd have to run make databricks-build
again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right, this is a one-off run. The alternative would be to create a process which watches /dist
, using e.g. dbx sync
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this makes sense for now. A continues sync process might be handy if we're building the kedro-databricks plugin, but by that time there might already be another way to do that.
"""Get the path of the whl file to install. If multiple whl files are found, | ||
return the file with the highest version number. | ||
""" | ||
dist_path = Path(__file__).resolve().parent.parent / "dist" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be better to clean dist
after each run of make databricks-build
Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
…ro-org/kedro into feat/add-databricks-build-automation Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com>
Co-authored-by: Sajid Alam <90610031+SajidAlamQB@users.noreply.github.com>
Co-authored-by: Sajid Alam <90610031+SajidAlamQB@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome job with this, thank you! 🌟
…n a Databricks cluster (kedro-org#2369) * Add databricks_build.py Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add databricks-build target to Makefile Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add new documentation page Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Complete first draft of the documentation page for databricks-build Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add verification step Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove spurious newlines at end of file Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify subtitle capitalisation Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Lint Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add databricks guide to index Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add release note Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add note directing Kedro users to Databricks deployment docs Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add note for Kedro developers to project deployment guide Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove 'prerequisites' Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify numbers to bullet points Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove spurious reference to 'version' Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove spurious 'installed' Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify git repo requirement Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Add Windows env variable instructions and cluster ID example Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Add 'In your terminal,' to instruction Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Add final sentence Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Reformulate final sentence Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Reformulate python requirement Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Change order of introductory note Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Add PowerShell commands Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Sajid Alam <90610031+SajidAlamQB@users.noreply.github.com> * Update tools/databricks_build.py Co-authored-by: Sajid Alam <90610031+SajidAlamQB@users.noreply.github.com> --------- Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> Co-authored-by: Sajid Alam <90610031+SajidAlamQB@users.noreply.github.com> Signed-off-by: Danny Farah <danny.farah@quantumblack.com>
…n a Databricks cluster (kedro-org#2369) * Add databricks_build.py Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add databricks-build target to Makefile Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add new documentation page Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Complete first draft of the documentation page for databricks-build Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add verification step Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove spurious newlines at end of file Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify subtitle capitalisation Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Lint Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add databricks guide to index Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add release note Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add note directing Kedro users to Databricks deployment docs Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add note for Kedro developers to project deployment guide Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove 'prerequisites' Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify numbers to bullet points Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove spurious reference to 'version' Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove spurious 'installed' Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify git repo requirement Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Add Windows env variable instructions and cluster ID example Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Add 'In your terminal,' to instruction Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Add final sentence Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Reformulate final sentence Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Reformulate python requirement Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Change order of introductory note Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Add PowerShell commands Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Sajid Alam <90610031+SajidAlamQB@users.noreply.github.com> * Update tools/databricks_build.py Co-authored-by: Sajid Alam <90610031+SajidAlamQB@users.noreply.github.com> --------- Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> Co-authored-by: Sajid Alam <90610031+SajidAlamQB@users.noreply.github.com> Signed-off-by: Danny Farah <danny.farah@quantumblack.com>
…n a Databricks cluster (kedro-org#2369) * Add databricks_build.py Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add databricks-build target to Makefile Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add new documentation page Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Complete first draft of the documentation page for databricks-build Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add verification step Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove spurious newlines at end of file Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify subtitle capitalisation Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Lint Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add databricks guide to index Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add release note Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add note directing Kedro users to Databricks deployment docs Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Add note for Kedro developers to project deployment guide Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove 'prerequisites' Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify numbers to bullet points Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove spurious reference to 'version' Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Remove spurious 'installed' Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Modify git repo requirement Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Add Windows env variable instructions and cluster ID example Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Add 'In your terminal,' to instruction Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Add final sentence Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Reformulate final sentence Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Reformulate python requirement Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Change order of introductory note Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> * Add PowerShell commands Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> * Update docs/source/contribution/development_for_databricks.md Co-authored-by: Sajid Alam <90610031+SajidAlamQB@users.noreply.github.com> * Update tools/databricks_build.py Co-authored-by: Sajid Alam <90610031+SajidAlamQB@users.noreply.github.com> --------- Signed-off-by: Jannic Holzer <jannic.holzer@quantumblack.com> Co-authored-by: Jo Stichbury <jo_stichbury@mckinsey.com> Co-authored-by: Sajid Alam <90610031+SajidAlamQB@users.noreply.github.com> Signed-off-by: Danny Farah <danny.farah@quantumblack.com>
Description
Resolves #2318.
This PR adds tooling for building a local version of Kedro and uploading it to the developer's Databricks cluster. This greatly speeds up development time and is an important step towards improving Kedro for Databricks.
It also adds documentation for using this tooling.
Checklist
RELEASE.md
file