-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add uc volumes support #82
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Would be great if you can make sure it works with the old and new cluster
glob_path: Optional[Union[str, List[str]]] = None | ||
glob_path: Optional[Union[str, List[str]]] = None, | ||
use_volumes: Optional[bool] = False, | ||
dst_path: Optional[str] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we deprecate dbfs_path
since you can cover the use case with dst_path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a warning there so people can still use it now but we can deprecate it in the next version.
@@ -54,49 +53,53 @@ def launch( | |||
project_location: str = ".", | |||
dbfs_path: Optional[str] = None, | |||
watch: bool = True, | |||
glob_path: Optional[Union[str, List[str]]] = None | |||
glob_path: Optional[Union[str, List[str]]] = None, | |||
use_volumes: Optional[bool] = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean use_volumes
? we don't want to change the current behaviour but only add an option for using volumes.
rocket/rocket.py
Outdated
|
||
if dbfs_path is not None and not dbfs_path.startswith("dbfs:/"): | ||
raise Exception("`dbfs_path` must start with dbfs:/") | ||
|
||
try: | ||
execute_shell_command(f"databricks fs ls dbfs:/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add here some validation if the databricks workspace API works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I've added some :)
try: | ||
execute_shell_command(f"databricks fs ls dbfs:/") | ||
except Exception as e: | ||
raise Exception( | ||
f"Error accessing DBFS via databricks-cli. Please check if your databricks token is set and valid? Try to generate a new token and update existing one with `databricks configure --token`. Error details: {e}" | ||
) | ||
|
||
if not dbfs_path: | ||
dbfs_path = f"dbfs:/temp/{os.environ['USER']}" | ||
if use_volumes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something like if is_dbfs_path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we did that, we'd need to choose some default behavior in case no path is provided (since dst_path
is optional), so it would have to be something like:
if dst_path is not None and is_dbfs_path(dst_path)
And so you wouldn't be able to upload to dbfs without providing some path, which would be backwards incompatible.
setup.py
Outdated
@@ -9,15 +9,15 @@ | |||
|
|||
setuptools.setup( | |||
name="databricks-rocket", | |||
version="2.1.0", | |||
version="2.2.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's set it as a major version update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so bump it to 3.0.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's backwards compatible, so I would think it should be a minor version update.
author="GetYourGuide", | ||
author_email="engineering.data-products@getyourguide.com", | ||
description="Keep your local python scripts installed and in sync with a databricks notebook. Shortens the feedback loop to develop projects using a hybrid enviroment", | ||
long_description=readme, | ||
long_description_content_type="text/markdown", | ||
url="https://github.com/getyourguide/db-rocket", | ||
packages=setuptools.find_packages(), | ||
install_requires=["fire", "watchdog~=2.1.9", "build", "databricks_cli"], | ||
install_requires=["fire", "watchdog~=2.1.9", "build", "databricks_cli", "databricks-sdk==0.33.0"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to pin the version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as you mentioned, this sdk is still in beta, and when integrating it, I looked into the project, which seems to be a lot of things under development, so pin the version can guarantee things don't break.
Description
This PR enables support for uploading to Volumes.
Issue:
After migrating to tropic3.5, we can not correctly install projects in the cluster due to permission issues (see here and here). The release notes indicate that storing libraries in DBFS root is deprecated and disabled by default and it is recommended to use Unity Catalog Volumes.
We could resolve this issue by uploading our projects to /Volumes. However, only Databricks CLI versions above 0.214.0 support interaction with Volumes; otherwise, it raises an
Authorization failed. Your token may be expired or lack the valid scope
error. The current databricks-cli is version 0.18 and won't be updated further. One of our options is to use the databricks-sdk, which is still in beta phase. During integration, I found the SDK's functionality to be lacking, but it can still unblock us from using a cluster with Unity Catalog enabled.The default path of upload is here.
notebook of installing ranking in tropic3.5
Added tests?
Added to documentation?