Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add uc volumes support #82

Merged
merged 5 commits into from
Oct 7, 2024
Merged

add uc volumes support #82

merged 5 commits into from
Oct 7, 2024

Conversation

hsiehkl
Copy link
Contributor

@hsiehkl hsiehkl commented Oct 2, 2024

Description

This PR enables support for uploading to Volumes.

Issue:
After migrating to tropic3.5, we can not correctly install projects in the cluster due to permission issues (see here and here). The release notes indicate that storing libraries in DBFS root is deprecated and disabled by default and it is recommended to use Unity Catalog Volumes.

We could resolve this issue by uploading our projects to /Volumes. However, only Databricks CLI versions above 0.214.0 support interaction with Volumes; otherwise, it raises an Authorization failed. Your token may be expired or lack the valid scope error. The current databricks-cli is version 0.18 and won't be updated further. One of our options is to use the databricks-sdk, which is still in beta phase. During integration, I found the SDK's functionality to be lacking, but it can still unblock us from using a cluster with Unity Catalog enabled.

The default path of upload is here.

notebook of installing ranking in tropic3.5

Added tests?

  • 👍 yes
  • 🙅 no, because they aren't needed

Added to documentation?

  • 👍 README.md
  • 👍 CHANGELOG.md
  • 👍 Additional documentation in /docs
  • 👍 Relevant code documentation
  • 🙅 no, because they aren’t needed

@hsiehkl hsiehkl marked this pull request as ready for review October 2, 2024 14:10
@hsiehkl hsiehkl requested a review from a team as a code owner October 2, 2024 14:10
@hsiehkl hsiehkl requested a review from oliviagyg October 2, 2024 14:10
setup.py Outdated Show resolved Hide resolved
Copy link
Contributor

@oliviagyg oliviagyg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@steven-mi steven-mi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Would be great if you can make sure it works with the old and new cluster

glob_path: Optional[Union[str, List[str]]] = None
glob_path: Optional[Union[str, List[str]]] = None,
use_volumes: Optional[bool] = False,
dst_path: Optional[str] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we deprecate dbfs_path since you can cover the use case with dst_path

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a warning there so people can still use it now but we can deprecate it in the next version.

@@ -54,49 +53,53 @@ def launch(
project_location: str = ".",
dbfs_path: Optional[str] = None,
watch: bool = True,
glob_path: Optional[Union[str, List[str]]] = None
glob_path: Optional[Union[str, List[str]]] = None,
use_volumes: Optional[bool] = False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean use_volumes? we don't want to change the current behaviour but only add an option for using volumes.

rocket/rocket.py Outdated

if dbfs_path is not None and not dbfs_path.startswith("dbfs:/"):
raise Exception("`dbfs_path` must start with dbfs:/")

try:
execute_shell_command(f"databricks fs ls dbfs:/")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add here some validation if the databricks workspace API works?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I've added some :)

try:
execute_shell_command(f"databricks fs ls dbfs:/")
except Exception as e:
raise Exception(
f"Error accessing DBFS via databricks-cli. Please check if your databricks token is set and valid? Try to generate a new token and update existing one with `databricks configure --token`. Error details: {e}"
)

if not dbfs_path:
dbfs_path = f"dbfs:/temp/{os.environ['USER']}"
if use_volumes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something like if is_dbfs_path

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we did that, we'd need to choose some default behavior in case no path is provided (since dst_path is optional), so it would have to be something like:

if dst_path is not None and is_dbfs_path(dst_path)

And so you wouldn't be able to upload to dbfs without providing some path, which would be backwards incompatible.

setup.py Outdated
@@ -9,15 +9,15 @@

setuptools.setup(
name="databricks-rocket",
version="2.1.0",
version="2.2.0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's set it as a major version update

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so bump it to 3.0.0?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's backwards compatible, so I would think it should be a minor version update.

author="GetYourGuide",
author_email="engineering.data-products@getyourguide.com",
description="Keep your local python scripts installed and in sync with a databricks notebook. Shortens the feedback loop to develop projects using a hybrid enviroment",
long_description=readme,
long_description_content_type="text/markdown",
url="https://github.com/getyourguide/db-rocket",
packages=setuptools.find_packages(),
install_requires=["fire", "watchdog~=2.1.9", "build", "databricks_cli"],
install_requires=["fire", "watchdog~=2.1.9", "build", "databricks_cli", "databricks-sdk==0.33.0"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to pin the version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as you mentioned, this sdk is still in beta, and when integrating it, I looked into the project, which seems to be a lot of things under development, so pin the version can guarantee things don't break.

@oliviagyg oliviagyg merged commit 751449d into main Oct 7, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants