-
Notifications
You must be signed in to change notification settings - Fork 601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HfFileSystem] Support quoted revisions in path #1888
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
@@ -27,7 +27,7 @@ | |||
# Regex used to match special revisions with "/" in them (see #1710) | |||
SPECIAL_REFS_REVISION_REGEX = re.compile( | |||
r""" | |||
(^refs\/convert\/parquet) # `refs/convert/parquet` revisions | |||
(^refs\/convert\/\w+) # `refs/convert/parquet` revisions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(kinda unrelated change)
just in case we add other convert branches 🙃
(we're currently discussing moving the duckdb indexes - could be in the parquet branch but named differently or in another convert/ branch)
@@ -379,7 +387,7 @@ def _ls_tree( | |||
def glob(self, path, **kwargs): | |||
# Set expand_info=False by default to get a x10 speed boost | |||
kwargs = {"expand_info": kwargs.get("detail", False), **kwargs} | |||
path = self.resolve_path(path).unresolve() | |||
path = self.resolve_path(path, revision=kwargs.get("revision")).unresolve() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this ensures all the returned paths will start with path
(glob filters out paths that don't start with the input path)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Two nits:
@@ -43,10 +43,17 @@ class HfFileSystemResolvedPath: | |||
repo_id: str | |||
revision: str | |||
path_in_repo: str | |||
# The part placer after '@' in the path (can even be a quoted or unquoted special refs revision) | |||
_revision_in_path: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_revision_in_path: Optional[str] = None | |
_revision_in_path: Optional[str] = filed(repr=False, default=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it was fixed to "field" ;)
tests/test_hf_file_system.py
Outdated
@@ -310,14 +340,14 @@ def test_find_root_directory_no_revision_with_incomplete_cache(self): | |||
repo_type="dataset", | |||
) | |||
|
|||
files = self.hffs.find(self.hf_path, detail=True) | |||
files = copy.deepcopy(self.hffs.find(self.hf_path, detail=True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need copy.deepcopy
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the next lines we .pop()
the cache, which also modifies the files
dictionary and make the test fail - maybe I can fix find
to not have to do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @lhoestq! As discussed on slack, this is indeed the way it should be to ensure robustness. I left a couple of comments straightforward to address otherwise looks good to me.
# The part placer after '@' in the path (can even be a quoted or unquoted special refs revision) | ||
_revision_in_path: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rename the attribute if that's fine with you to really emphasize it is the revision as provided by the user. I found revision_in_path
to be slightly vague. I suggested _raw_revision
but happy for other suggestions.
(would have to be updated in the PR)
# The part placer after '@' in the path (can even be a quoted or unquoted special refs revision) | |
_revision_in_path: Optional[str] = None | |
# The part placed after '@' in the initial path. It can be a quoted or unquoted refs revision. | |
# Used to reconstruct the unresolved path to return to the user. | |
_raw_revision: Optional[str] = None |
return f"{repo_path}@{safe_revision(self.revision)}/{self.path_in_repo}".rstrip("/") | ||
if self._revision_in_path: | ||
return f"{repo_path}@{self._revision_in_path}/{self.path_in_repo}".rstrip("/") | ||
elif self.revision != "main": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit)
elif self.revision != "main": | |
elif self.revision != DEFAULT_REVISION: |
resolved_path.repo_type, | ||
resolved_path.repo_id, | ||
resolved_path.revision, | ||
"", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) I know it's not related to your PR but let's be explicit about what the empty string means
"", | |
path_in_repo="", |
resolved_path.repo_type, | ||
resolved_path.repo_id, | ||
resolved_path.revision, | ||
"", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) same here
"", | |
path_in_repo="", |
tests/test_hf_file_system.py
Outdated
@@ -310,14 +340,14 @@ def test_find_root_directory_no_revision_with_incomplete_cache(self): | |||
repo_type="dataset", | |||
) | |||
|
|||
files = self.hffs.find(self.hf_path, detail=True) | |||
files = copy.deepcopy(self.hffs.find(self.hf_path, detail=True)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a deep copy? Is it because files
is a reference to an internal object of self.hffs.find
? If that's the case it would be a bit worrying 😕
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I can fix find
to return a copy of the cache instead of the cache content itself (that users could modify)
I ended up moving the copy.deepcopy to the return statements of I also renamed the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the changes @lhoestq!
(failing tests are unrelated so feel free to merge)
Added support quoted revisions in paths in HfFileSystem, and return paths formatted the same way.
This potentially supersedes #1889 (I find it simpler)
Now this code works (otherwise returns empty directory for quoted revision)
prints
One consequence is that revisions are not unquoted in the returned paths, but I feel it is the expected behavior to return paths in the same format (except if revision is overridden explicitly using a keyword argument)