-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No need for dataset_info #7234
No need for dataset_info #7234
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
hf_api = HfApi(config.HF_ENDPOINT) | ||
# Get the Dataset Card + get the revision + check authentication all at in one call | ||
# We fix the commit_hash in case there are new commits in the meantime | ||
api = HfApi( | ||
endpoint=config.HF_ENDPOINT, | ||
token=download_config.token, | ||
library_name="datasets", | ||
library_version=__version__, | ||
user_agent=get_datasets_user_agent(download_config.user_agent), | ||
) | ||
try: | ||
dataset_info = hf_api.dataset_info( | ||
_raise_if_offline_mode_is_enabled() | ||
dataset_readme_path = api.hf_hub_download( | ||
repo_id=path, | ||
filename=config.REPOCARD_FILENAME, | ||
repo_type="dataset", | ||
revision=revision, | ||
token=download_config.token, | ||
timeout=100.0, | ||
proxies=download_config.proxies, | ||
) | ||
commit_hash = os.path.basename(os.path.dirname(dataset_readme_path)) | ||
except EntryNotFoundError as e: | ||
if "internet connection" in str(e).lower(): | ||
raise ConnectionError(f"Couldn't reach '{path}' on the Hub ({e.__class__.__name__})") from e | ||
commit_hash = api.dataset_info( | ||
path, | ||
revision=revision, | ||
timeout=100.0, | ||
).sha |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
main change is here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
except EntryNotFoundError as e:
if "internet connection" in str(e).lower():
raise ConnectionError(f"Couldn't reach '{path}' on the Hub ({e.__class__.__name__})") from e
Do you have an example of EntryNotFoundError
where the body contains "internet connection"
and the initial error is not explicit enough? I'm a bit wary of relying on exception message as it can be changed without prior notice in huggingface_hub
. Also if the problem is that the exception message is not good enough, it's preferable to fix it in huggingface_hub
directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this once is when the connection to HF fails, it raises LocalEntryNotFoundError because it can't find the cached file when there is no connection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And isn't the raise LocalEntryNotFoundError from ...
enough for the users?
except EntryNotFoundError: | ||
# Use the infos from the parquet export except in some cases: | ||
if data_dir or data_files or (revision and revision != "main"): | ||
use_exported_dataset_infos = False | ||
else: | ||
use_exported_dataset_infos = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I also moved this from HubDatasetModuleFactoryWithoutScript directly here since I don't pass the requestedrevision
anymore but the fixed commit_hash
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the ping @lhoestq!
The overall logic looks good. I've left a few nits/questions below but feel free to move forward
try: | ||
dataset_info = hf_api.dataset_info( | ||
_raise_if_offline_mode_is_enabled() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if this line is not introduced? huggingface_hub
is supposed to raise if HF_HUB_OFFLINE
is set.
If it's really needed, I would move it outside of the try/except, to make it clear that it's not related.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line has just been moved and handles HF_DATASETS_OFFLINE
merging this one for now, let me know if you'd like to see additional changes for error handling (I'll take care of them before doing a release) |
save a useless call to /api/datasets/repo_id