-
Notifications
You must be signed in to change notification settings - Fork 3k
Allow downloading/processing/caching only specific splits #2249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
if self._download_config.only_splits: | ||
if isinstance(url_or_urls, dict) and all( | ||
split in url_or_urls for split in self._download_config.only_splits | ||
): | ||
url_or_urls = {split: url_or_urls[split] for split in self._download_config.only_splits} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great trick ! However as you know this won't work on all cases unfortunately.
For example some datasets use a dictionary with keys that are not related to the splits since we allow any nested structure to be passed for downloading.
What do you think about adding an argument to _split_generators
to tell it to return only a subset of the split generators ? In this case it would download only the files related to the requested splits
Then we can just edit the datasets that can benefit from this feature the most.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that _split_generators
is already implemented in ALL canonical datasets without that parameter...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can pass the parameter only if it's part of the signature of _split_generators
?
We're already doing that for other optional parameters like pipeline
in BeamBasedBuilder._split_generators
:
datasets/src/datasets/builder.py
Lines 1065 to 1067 in 909c58f
split_generators_arg_names = inspect.signature(self._split_generators).parameters.keys() | |
if "pipeline" in split_generators_arg_names: | |
split_generators_kwargs["pipeline"] = prepare_split_kwargs["pipeline"] |
I'm fine with this option but let me know if you have other ideas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I pass the parameter splits
to _split_generators
. However, it must be handled properly by the method implementations.
Note that with my previous trick in download
, it already works for natural_questions
without even touching its current implementation of _split_generators.
😉
That is because in natural_questions
:
_DOWNLOAD_URLS = {
"train": ["%s/train/nq-train-%02d.jsonl.gz" % (_BASE_DOWNLOAD_URL, i) for i in range(50)],
"validation": ["%s/dev/nq-dev-%02d.jsonl.gz" % (_BASE_DOWNLOAD_URL, i) for i in range(5)],
}
ed62a47
to
f733347
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love it ! thank you :)
I added some comments:
url_or_urls. | ||
""" | ||
if self._download_config.splits: | ||
if isinstance(url_or_urls, dict) and all(split in url_or_urls for split in self._download_config.splits): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add an additional check just to avoid unwanted behaviors.
For example right now if a dataset script passes to the dl_manager a dict like this:
{"main_data": url_to_main_data, "secondary_data": url_to_sec_data}
then this trick here would use url_or_urls = {}
since the keys of the dict are not split names.
Maybe you could check that sorted(url_or_urls.keys()) == sorted(self._download_config.splits)
before filtering ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit:
I meant a dict like this
{"main_metadata": url_to_main_data, "secondary_metadata": url_to_sec_data, "train": url_train_data, "test": url_test_data}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @lhoestq. Sorry, I'm not sure of understanding what you mean... 😅
What I am checking here is that all the keys in self._download_config.splits
are a subset of the keys in url_or_urls
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you pass a dictionary like this:
{"main_metadata": url_to_main_data,
"secondary_metadata": url_to_sec_data,
"train": url_train_data,
"test": url_test_data}
then only the train or test keys will be kept, which I feel not intuitive.
For example if the users asks to load the "train" split, then the main and secondary metadata won't be downloaded.
You can fix that by keeping all the keys except the splits to ignore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lhoestq, I understand it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment below.
extracted_paths.data | ||
if not isinstance(extracted_paths.data, dict) | ||
else defaultdict(str, extracted_paths.data) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the user implementation of the method _split_generators
will access all possible keys in the extracted path dict and may certainly access them through __getitem__
. If our downloaded paths haven't all possible splits (but only a subset), accessing the missing ones through __getitem__
will raise a KeyError
exception.
If a dafaultdict
is returned instead, trying to access a missing (not downloaded) splits will be silently ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense ! Maybe use a more explicit placeholder that an empty string, in case users experience errors using this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've set "<NOT_DOWNLOADED>"
instead.
src/datasets/utils/file_utils.py
Outdated
force_download: Optional[bool] = None # default False | ||
_is_force_download_default: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has to do with the fact of current over-use of DownloadConfig
for many different tasks/concerns...
Note that DownloadConfig
is passed to: prepare_module
, cached_path
, download_and_prepare
, DownloadManager
(which uses it for download_custom
, download
and extract
)...
Anyway, in download_and_prepare
if download_config
is None, then a new instance is built AND it is set with some specific non-default attribute values (for example force_download=bool(download_mode == GenerateMode.FORCE_REDOWNLOAD)
). What if download_config
is not None (as the use case I am implementing where an already created instance of DownloadConfig
is passed)? Then, for the example on the attribute force_download
I have to consider 2 possibilities:
- either the user instantiates
DownloadConfig
with an explicit value forforce_download
: then I want to preserve the value passed by the user - or the user instantiates
DownloadConfig
without an explicit value forforce_download
(and its default value is set): here I wanted to preserve current behavior of the function, that is, override the non-explicitly set value, withbool(download_mode == GenerateMode.FORCE_REDOWNLOAD)
. This way, people passing theDownloadConfig(splits=["train"])
will not have any other side effect because of the parameterforce_download
.
I just assumed that there is a reason for having some default values in DownloadConfig
which are different from the ones set by default by download_and_prepare
. Other option could be to change the current default values in DownloadConfig
, but this may raise backward compatibility issues and impact current user experience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see !
What if the value force_download
is updated during download_and_prepare
? In this case _is_force_download_default
would still be True even though it's not the default value "None" anymore.
I find the name of the attribute confusing in this case. Maybe rename it _is_force_download_specified_by_user
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have renamed the attributes, e.g. _is_force_download_set_by_user
.
num_proc: Optional[int] = None | ||
max_retries: int = 1 | ||
use_auth_token: Optional[Union[str, bool]] = None | ||
splits: Optional[list] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DownloadConfig class was used until now as a the class that defines the parameters that we pass to the cached_path
function, and it had no logic related to the DownloadManager or to datasets specifically.
Therefore I'm not sure this is best place to put this argument. Let me know what you think.
Maybe this could be an argument of the DownloadManager itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not a clear opinion on this yet (I'm going to think about it), but I can tell you that for readability, a data class called DownloadConfig
should be clearly used by a DownloadManager
class.
In my opinion I find quite sensible to include a download configuration setting (whether to download all or only some of the files) in a DownloadConfig
object. And in order to avoid passing lots of parameters to load_dataset
, it makes sense to pass the parameters related to download
stuff in a DownloadConfig
class.
As a side note, I am planning to refactor also cached_path
because it contains too many different coupled functionalities: download, extract, load from cache... For the moment, I have extracted all Extract functionalities. But I am planning further refactoring...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The separation between the two configs makes sense to me, and I totally agree with you with the naming.
Maybe we could have DownloadConfig
as a subclass of CachedPathConfig
or something like that (or another relation between the two).
Regarding cached_path, note that this is a function that we have in common with the other libraries (transformers and huggingface_hub) so it could be worth discussing with the other maintainers about the changes we want to do.
assert (dataset["train"].dataset_size < max_in_memory_dataset_size) is expected_in_memory | ||
|
||
|
||
class TestLoadDatasetOnlySplits: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also check this scenario:
- users load a dataset with the "train" split only
- users reloads this dataset but this time asking for the "test" set only
The caching mechanism should notice that the "test" set is missing and download and prepare it. If I'm not wrong currently this would fail because of this line:
datasets/src/datasets/builder.py
Lines 504 to 511 in 097129d
data_exists = os.path.exists(self._cache_dir) | |
if data_exists and download_mode == GenerateMode.REUSE_DATASET_IF_EXISTS: | |
logger.warning("Reusing dataset %s (%s)", self.name, self._cache_dir) | |
# We need to update the info in case some splits were added in the meantime | |
# for example when calling load_dataset from multiple workers. | |
self.info = self._load_info() | |
self.download_post_processing_resources(dl_manager) | |
return |
Instead of checking for the cache_dir, we should check for the actual arrow split file to exist inside cache_dir
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Current implementation of load_dataset
assumes that if there is something in the cache, then it is all you can get, unless you pass force_download
.
I assumed that if the user passes a parameter to download only one split, then they are aware that only that split is in the cache and only that split can be loaded. In order to force a subsequent download of other split, they should pass force_download
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big fan of that. In my opinion it should detect that the requested split is missing and generate it.
This will avoid some confusions for users
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also a breaking change we can't afford IMO
Hi @lhoestq, I have been thinking about this and I think it is worth that we discuss about it. When I created this PR, my first idea was to create a "hack" inside the download manager that will be able to filter some split(s) without touching any dataset script. Of course, the download manager does not know about splits logic, and thus this trick would only work for some very specific datasets: only the ones containing that pass a dict to the download manager containing only the keys "train", "validation", "test" (or the one passed by the user for advanced users knowing they can do it), e.g. the The big inconvenient of this approach is that it is not applicable to many datasets (or worse, it should be constantly tweaked to cope with exceptional cases). One exceptional case is the one you pointed out. But I see others:
Then after discussing, another idea emerged: pass a My point is that I don't think it is a good idea to implement both approaches. They could even interfere with each other! If you agree, I would implement ONLY the second one, which is simpler, more consistent and stable and will avoid future problems. |
Hi @albertvillanova ! |
Allow downloading/processing/caching only specific splits without downloading/processing/caching the other splits.
This PR implements two steps to handle only specific splits:
_split_generators
)This PR makes several assumptions:
DownloadConfig
contains the configuration settings for downloadingsplit
passed toload_dataset
is just a parameter for loading (from cache), not for downloading