Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HfFileSystem] Support quoted revisions in path #1888

Merged
merged 7 commits into from
Dec 6, 2023

Conversation

lhoestq
Copy link
Member

@lhoestq lhoestq commented Dec 5, 2023

Added support quoted revisions in paths in HfFileSystem, and return paths formatted the same way.

This potentially supersedes #1889 (I find it simpler)

Now this code works (otherwise returns empty directory for quoted revision)

from huggingface_hub import HfFileSystem

fs = HfFileSystem()
path = "hf://datasets/squad/*"
print(fs.glob(path))
path = "hf://datasets/squad@refs%2Fconvert%2Fparquet/plain_text/*"
print(fs.glob(path))
path = "hf://datasets/squad@refs/convert/parquet/plain_text/*"
print(fs.glob(path))
path = "hf://datasets/squad/plain_text/*"
print(fs.glob(path, revision="refs/convert/parquet"))

prints

['datasets/squad/.gitattributes', 'datasets/squad/README.md', 'datasets/squad/dataset_infos.json', 'datasets/squad/squad.py']
['datasets/squad@refs%2Fconvert%2Fparquet/plain_text/train', 'datasets/squad@refs%2Fconvert%2Fparquet/plain_text/validation']
['datasets/squad@refs/convert/parquet/plain_text/train', 'datasets/squad@refs/convert/parquet/plain_text/validation']
['datasets/squad@refs/convert/parquet/plain_text/train', 'datasets/squad@refs/convert/parquet/plain_text/validation']

One consequence is that revisions are not unquoted in the returned paths, but I feel it is the expected behavior to return paths in the same format (except if revision is overridden explicitly using a keyword argument)

@lhoestq lhoestq changed the title Support quoted revisions [HfFileSystem] Support quoted revisions Dec 5, 2023
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@lhoestq lhoestq changed the title [HfFileSystem] Support quoted revisions [HfFileSystem] Support quoted revisions in path Dec 5, 2023
@lhoestq lhoestq marked this pull request as ready for review December 5, 2023 15:56
@@ -27,7 +27,7 @@
# Regex used to match special revisions with "/" in them (see #1710)
SPECIAL_REFS_REVISION_REGEX = re.compile(
r"""
(^refs\/convert\/parquet) # `refs/convert/parquet` revisions
(^refs\/convert\/\w+) # `refs/convert/parquet` revisions
Copy link
Member Author

@lhoestq lhoestq Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(kinda unrelated change)

just in case we add other convert branches 🙃

(we're currently discussing moving the duckdb indexes - could be in the parquet branch but named differently or in another convert/ branch)

@@ -379,7 +387,7 @@ def _ls_tree(
def glob(self, path, **kwargs):
# Set expand_info=False by default to get a x10 speed boost
kwargs = {"expand_info": kwargs.get("detail", False), **kwargs}
path = self.resolve_path(path).unresolve()
path = self.resolve_path(path, revision=kwargs.get("revision")).unresolve()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this ensures all the returned paths will start with path (glob filters out paths that don't start with the input path)

@lhoestq lhoestq requested review from Wauplin and mariosasko December 5, 2023 16:09
Copy link
Contributor

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Two nits:

@@ -43,10 +43,17 @@ class HfFileSystemResolvedPath:
repo_id: str
revision: str
path_in_repo: str
# The part placer after '@' in the path (can even be a quoted or unquoted special refs revision)
_revision_in_path: Optional[str] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_revision_in_path: Optional[str] = None
_revision_in_path: Optional[str] = filed(repr=False, default=None)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

field?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it was fixed to "field" ;)

@@ -310,14 +340,14 @@ def test_find_root_directory_no_revision_with_incomplete_cache(self):
repo_type="dataset",
)

files = self.hffs.find(self.hf_path, detail=True)
files = copy.deepcopy(self.hffs.find(self.hf_path, detail=True))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need copy.deepcopy here?

Copy link
Member Author

@lhoestq lhoestq Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the next lines we .pop() the cache, which also modifies the files dictionary and make the test fail - maybe I can fix find to not have to do that

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @lhoestq! As discussed on slack, this is indeed the way it should be to ensure robustness. I left a couple of comments straightforward to address otherwise looks good to me.

Comment on lines 46 to 47
# The part placer after '@' in the path (can even be a quoted or unquoted special refs revision)
_revision_in_path: Optional[str] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename the attribute if that's fine with you to really emphasize it is the revision as provided by the user. I found revision_in_path to be slightly vague. I suggested _raw_revision but happy for other suggestions.

(would have to be updated in the PR)

Suggested change
# The part placer after '@' in the path (can even be a quoted or unquoted special refs revision)
_revision_in_path: Optional[str] = None
# The part placed after '@' in the initial path. It can be a quoted or unquoted refs revision.
# Used to reconstruct the unresolved path to return to the user.
_raw_revision: Optional[str] = None

return f"{repo_path}@{safe_revision(self.revision)}/{self.path_in_repo}".rstrip("/")
if self._revision_in_path:
return f"{repo_path}@{self._revision_in_path}/{self.path_in_repo}".rstrip("/")
elif self.revision != "main":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit)

Suggested change
elif self.revision != "main":
elif self.revision != DEFAULT_REVISION:

resolved_path.repo_type,
resolved_path.repo_id,
resolved_path.revision,
"",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) I know it's not related to your PR but let's be explicit about what the empty string means

Suggested change
"",
path_in_repo="",

resolved_path.repo_type,
resolved_path.repo_id,
resolved_path.revision,
"",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) same here

Suggested change
"",
path_in_repo="",

@@ -310,14 +340,14 @@ def test_find_root_directory_no_revision_with_incomplete_cache(self):
repo_type="dataset",
)

files = self.hffs.find(self.hf_path, detail=True)
files = copy.deepcopy(self.hffs.find(self.hf_path, detail=True))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a deep copy? Is it because files is a reference to an internal object of self.hffs.find? If that's the case it would be a bit worrying 😕

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I can fix find to return a copy of the cache instead of the cache content itself (that users could modify)

@lhoestq
Copy link
Member Author

lhoestq commented Dec 6, 2023

I ended up moving the copy.deepcopy to the return statements of HfFileSystem._ls_tree and HfFileSystem.info (where we use the dircache) - let me know if it sounds good to you and I can merge.

I also renamed the HfFileSystemResolvedPath attribute _raw_revision

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the changes @lhoestq!

(failing tests are unrelated so feel free to merge)

@lhoestq lhoestq merged commit 082c1ed into main Dec 6, 2023
13 of 16 checks passed
@lhoestq lhoestq deleted the support-quoted-revisions branch December 6, 2023 12:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants