Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No need for dataset_info #7234

Merged
merged 7 commits into from
Oct 21, 2024
Merged

No need for dataset_info #7234

merged 7 commits into from
Oct 21, 2024

Conversation

lhoestq
Copy link
Member

@lhoestq lhoestq commented Oct 17, 2024

save a useless call to /api/datasets/repo_id

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@lhoestq lhoestq marked this pull request as ready for review October 18, 2024 15:42
Comment on lines -1578 to +1613
hf_api = HfApi(config.HF_ENDPOINT)
# Get the Dataset Card + get the revision + check authentication all at in one call
# We fix the commit_hash in case there are new commits in the meantime
api = HfApi(
endpoint=config.HF_ENDPOINT,
token=download_config.token,
library_name="datasets",
library_version=__version__,
user_agent=get_datasets_user_agent(download_config.user_agent),
)
try:
dataset_info = hf_api.dataset_info(
_raise_if_offline_mode_is_enabled()
dataset_readme_path = api.hf_hub_download(
repo_id=path,
filename=config.REPOCARD_FILENAME,
repo_type="dataset",
revision=revision,
token=download_config.token,
timeout=100.0,
proxies=download_config.proxies,
)
commit_hash = os.path.basename(os.path.dirname(dataset_readme_path))
except EntryNotFoundError as e:
if "internet connection" in str(e).lower():
raise ConnectionError(f"Couldn't reach '{path}' on the Hub ({e.__class__.__name__})") from e
commit_hash = api.dataset_info(
path,
revision=revision,
timeout=100.0,
).sha
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main change is here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

            except EntryNotFoundError as e:
                if "internet connection" in str(e).lower():
                    raise ConnectionError(f"Couldn't reach '{path}' on the Hub ({e.__class__.__name__})") from e

Do you have an example of EntryNotFoundError where the body contains "internet connection" and the initial error is not explicit enough? I'm a bit wary of relying on exception message as it can be changed without prior notice in huggingface_hub. Also if the problem is that the exception message is not good enough, it's preferable to fix it in huggingface_hub directly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this once is when the connection to HF fails, it raises LocalEntryNotFoundError because it can't find the cached file when there is no connection

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And isn't the raise LocalEntryNotFoundError from ... enough for the users?

@lhoestq lhoestq requested a review from Wauplin October 18, 2024 15:45
Comment on lines +1669 to +1674
except EntryNotFoundError:
# Use the infos from the parquet export except in some cases:
if data_dir or data_files or (revision and revision != "main"):
use_exported_dataset_infos = False
else:
use_exported_dataset_infos = True
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I also moved this from HubDatasetModuleFactoryWithoutScript directly here since I don't pass the requestedrevision anymore but the fixed commit_hash)

Copy link
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the ping @lhoestq!
The overall logic looks good. I've left a few nits/questions below but feel free to move forward

try:
dataset_info = hf_api.dataset_info(
_raise_if_offline_mode_is_enabled()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if this line is not introduced? huggingface_hub is supposed to raise if HF_HUB_OFFLINE is set.

If it's really needed, I would move it outside of the try/except, to make it clear that it's not related.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line has just been moved and handles HF_DATASETS_OFFLINE

src/datasets/load.py Outdated Show resolved Hide resolved
src/datasets/load.py Outdated Show resolved Hide resolved
@lhoestq
Copy link
Member Author

lhoestq commented Oct 21, 2024

merging this one for now, let me know if you'd like to see additional changes for error handling (I'll take care of them before doing a release)

@lhoestq lhoestq merged commit 59738d7 into main Oct 21, 2024
15 checks passed
@lhoestq lhoestq deleted the no-need-for-dataset_info branch October 21, 2024 16:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants