Skip to content

Conversation

albertvillanova
Copy link
Member

@albertvillanova albertvillanova commented Apr 22, 2021

Allow downloading/processing/caching only specific splits without downloading/processing/caching the other splits.

This PR implements two steps to handle only specific splits:

  • it allows processing/caching only specific splits into Arrow files
  • for some simple cases, it allows downloading only specific splits (which is more intricate as it depends on the user-defined method _split_generators)

This PR makes several assumptions:

  • DownloadConfig contains the configuration settings for downloading
  • the parameter split passed to load_dataset is just a parameter for loading (from cache), not for downloading

@albertvillanova albertvillanova added the enhancement New feature or request label Apr 22, 2021
@albertvillanova albertvillanova added this to the 1.7 milestone Apr 22, 2021
@albertvillanova albertvillanova changed the title Allow loading only specific splits Allow downloading/processing/caching only specific splits Apr 23, 2021
@albertvillanova albertvillanova changed the title Allow downloading/processing/caching only specific splits Allow processing/caching only specific splits Apr 23, 2021
@albertvillanova albertvillanova changed the title Allow processing/caching only specific splits Allow downloading/processing/caching only specific splits Apr 23, 2021
Comment on lines 186 to 190
if self._download_config.only_splits:
if isinstance(url_or_urls, dict) and all(
split in url_or_urls for split in self._download_config.only_splits
):
url_or_urls = {split: url_or_urls[split] for split in self._download_config.only_splits}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great trick ! However as you know this won't work on all cases unfortunately.
For example some datasets use a dictionary with keys that are not related to the splits since we allow any nested structure to be passed for downloading.

What do you think about adding an argument to _split_generators to tell it to return only a subset of the split generators ? In this case it would download only the files related to the requested splits

Then we can just edit the datasets that can benefit from this feature the most.

Copy link
Member Author

@albertvillanova albertvillanova Apr 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that _split_generators is already implemented in ALL canonical datasets without that parameter...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can pass the parameter only if it's part of the signature of _split_generators ?
We're already doing that for other optional parameters like pipeline in BeamBasedBuilder._split_generators:

split_generators_arg_names = inspect.signature(self._split_generators).parameters.keys()
if "pipeline" in split_generators_arg_names:
split_generators_kwargs["pipeline"] = prepare_split_kwargs["pipeline"]

I'm fine with this option but let me know if you have other ideas

Copy link
Member Author

@albertvillanova albertvillanova Apr 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I pass the parameter splits to _split_generators. However, it must be handled properly by the method implementations.

Note that with my previous trick in download, it already works for natural_questions without even touching its current implementation of _split_generators. 😉

That is because in natural_questions:

_DOWNLOAD_URLS = {
    "train": ["%s/train/nq-train-%02d.jsonl.gz" % (_BASE_DOWNLOAD_URL, i) for i in range(50)],
    "validation": ["%s/dev/nq-dev-%02d.jsonl.gz" % (_BASE_DOWNLOAD_URL, i) for i in range(5)],
}

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it ! thank you :)

I added some comments:

url_or_urls.
"""
if self._download_config.splits:
if isinstance(url_or_urls, dict) and all(split in url_or_urls for split in self._download_config.splits):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add an additional check just to avoid unwanted behaviors.
For example right now if a dataset script passes to the dl_manager a dict like this:

{"main_data": url_to_main_data, "secondary_data": url_to_sec_data}

then this trick here would use url_or_urls = {} since the keys of the dict are not split names.

Maybe you could check that sorted(url_or_urls.keys()) == sorted(self._download_config.splits) before filtering ?

Copy link
Member

@lhoestq lhoestq May 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit:

I meant a dict like this

{"main_metadata": url_to_main_data, "secondary_metadata": url_to_sec_data, "train": url_train_data, "test": url_test_data}

Copy link
Member Author

@albertvillanova albertvillanova Jun 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lhoestq. Sorry, I'm not sure of understanding what you mean... 😅

What I am checking here is that all the keys in self._download_config.splits are a subset of the keys in url_or_urls.

Copy link
Member

@lhoestq lhoestq Jun 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you pass a dictionary like this:

{"main_metadata": url_to_main_data,
"secondary_metadata": url_to_sec_data,
"train": url_train_data,
"test": url_test_data}

then only the train or test keys will be kept, which I feel not intuitive.

For example if the users asks to load the "train" split, then the main and secondary metadata won't be downloaded.
You can fix that by keeping all the keys except the splits to ignore

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lhoestq, I understand it now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment below.

extracted_paths.data
if not isinstance(extracted_paths.data, dict)
else defaultdict(str, extracted_paths.data)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this ?

Copy link
Member Author

@albertvillanova albertvillanova May 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the user implementation of the method _split_generators will access all possible keys in the extracted path dict and may certainly access them through __getitem__. If our downloaded paths haven't all possible splits (but only a subset), accessing the missing ones through __getitem__ will raise a KeyError exception.

If a dafaultdict is returned instead, trying to access a missing (not downloaded) splits will be silently ignored.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense ! Maybe use a more explicit placeholder that an empty string, in case users experience errors using this ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've set "<NOT_DOWNLOADED>" instead.

Comment on lines 232 to 233
force_download: Optional[bool] = None # default False
_is_force_download_default: bool = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this ?

Copy link
Member Author

@albertvillanova albertvillanova May 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has to do with the fact of current over-use of DownloadConfig for many different tasks/concerns...

Note that DownloadConfig is passed to: prepare_module, cached_path, download_and_prepare, DownloadManager (which uses it for download_custom, download and extract)...

Anyway, in download_and_prepare if download_config is None, then a new instance is built AND it is set with some specific non-default attribute values (for example force_download=bool(download_mode == GenerateMode.FORCE_REDOWNLOAD)). What if download_config is not None (as the use case I am implementing where an already created instance of DownloadConfig is passed)? Then, for the example on the attribute force_download I have to consider 2 possibilities:

  • either the user instantiates DownloadConfig with an explicit value for force_download: then I want to preserve the value passed by the user
  • or the user instantiates DownloadConfig without an explicit value for force_download (and its default value is set): here I wanted to preserve current behavior of the function, that is, override the non-explicitly set value, with bool(download_mode == GenerateMode.FORCE_REDOWNLOAD). This way, people passing the DownloadConfig(splits=["train"]) will not have any other side effect because of the parameter force_download.

I just assumed that there is a reason for having some default values in DownloadConfig which are different from the ones set by default by download_and_prepare. Other option could be to change the current default values in DownloadConfig, but this may raise backward compatibility issues and impact current user experience.

Copy link
Member

@lhoestq lhoestq May 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see !
What if the value force_download is updated during download_and_prepare ? In this case _is_force_download_default would still be True even though it's not the default value "None" anymore.
I find the name of the attribute confusing in this case. Maybe rename it _is_force_download_specified_by_user ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have renamed the attributes, e.g. _is_force_download_set_by_user.

num_proc: Optional[int] = None
max_retries: int = 1
use_auth_token: Optional[Union[str, bool]] = None
splits: Optional[list] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DownloadConfig class was used until now as a the class that defines the parameters that we pass to the cached_path function, and it had no logic related to the DownloadManager or to datasets specifically.
Therefore I'm not sure this is best place to put this argument. Let me know what you think.

Maybe this could be an argument of the DownloadManager itself.

Copy link
Member Author

@albertvillanova albertvillanova May 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not a clear opinion on this yet (I'm going to think about it), but I can tell you that for readability, a data class called DownloadConfig should be clearly used by a DownloadManager class.

In my opinion I find quite sensible to include a download configuration setting (whether to download all or only some of the files) in a DownloadConfig object. And in order to avoid passing lots of parameters to load_dataset, it makes sense to pass the parameters related to download stuff in a DownloadConfig class.

As a side note, I am planning to refactor also cached_path because it contains too many different coupled functionalities: download, extract, load from cache... For the moment, I have extracted all Extract functionalities. But I am planning further refactoring...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The separation between the two configs makes sense to me, and I totally agree with you with the naming.
Maybe we could have DownloadConfig as a subclass of CachedPathConfig or something like that (or another relation between the two).

Regarding cached_path, note that this is a function that we have in common with the other libraries (transformers and huggingface_hub) so it could be worth discussing with the other maintainers about the changes we want to do.

assert (dataset["train"].dataset_size < max_in_memory_dataset_size) is expected_in_memory


class TestLoadDatasetOnlySplits:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also check this scenario:

  1. users load a dataset with the "train" split only
  2. users reloads this dataset but this time asking for the "test" set only

The caching mechanism should notice that the "test" set is missing and download and prepare it. If I'm not wrong currently this would fail because of this line:

data_exists = os.path.exists(self._cache_dir)
if data_exists and download_mode == GenerateMode.REUSE_DATASET_IF_EXISTS:
logger.warning("Reusing dataset %s (%s)", self.name, self._cache_dir)
# We need to update the info in case some splits were added in the meantime
# for example when calling load_dataset from multiple workers.
self.info = self._load_info()
self.download_post_processing_resources(dl_manager)
return

Instead of checking for the cache_dir, we should check for the actual arrow split file to exist inside cache_dir

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current implementation of load_dataset assumes that if there is something in the cache, then it is all you can get, unless you pass force_download.

I assumed that if the user passes a parameter to download only one split, then they are aware that only that split is in the cache and only that split can be loaded. In order to force a subsequent download of other split, they should pass force_download.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big fan of that. In my opinion it should detect that the requested split is missing and generate it.
This will avoid some confusions for users

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also a breaking change we can't afford IMO

@albertvillanova albertvillanova modified the milestones: 1.7, 1.8 May 31, 2021
@albertvillanova albertvillanova modified the milestones: 1.8, 1.9 Jun 8, 2021
@albertvillanova
Copy link
Member Author

If you pass a dictionary like this:

{"main_metadata": url_to_main_data,
"secondary_metadata": url_to_sec_data,
"train": url_train_data,
"test": url_test_data}

then only the train or test keys will be kept, which I feel not intuitive.

For example if the users asks to load the "train" split, then the main and secondary metadata won't be downloaded.
You can fix that by keeping all the keys except the splits to ignore

Hi @lhoestq, I have been thinking about this and I think it is worth that we discuss about it.

When I created this PR, my first idea was to create a "hack" inside the download manager that will be able to filter some split(s) without touching any dataset script. Of course, the download manager does not know about splits logic, and thus this trick would only work for some very specific datasets: only the ones containing that pass a dict to the download manager containing only the keys "train", "validation", "test" (or the one passed by the user for advanced users knowing they can do it), e.g. the natural_questions dataset (which was one of the targets).

The big inconvenient of this approach is that it is not applicable to many datasets (or worse, it should be constantly tweaked to cope with exceptional cases). One exceptional case is the one you pointed out. But I see others:

  • the split keys can be different: train, test, dev, val, validation, eval,...
  • in hope_edi dataset, the split keys are: TRAIN_DOWNLOAD_URL, VALIDATION_DOWNLOAD_URL
  • in few_rel dataset, the split keys are: train_wiki, val_nyt, val_pubmed,..., pid2name
  • in curiosity_dialogs, the split keys are: train, val, test, test_zero; this means that for every split we pass, we will also get test_zero
  • in deal_or_no_dialog, each of the splits URL is passed separately to the download manager, so all splits would be always downloaded
  • etc.

Then after discussing, another idea emerged: pass a split parameter to _split_generators, which know about the splits logic, so that it can handle which splits are passed to the download manager. This approach is more accurate and can be tweaked so that it works with all the datasets we want. The only inconvenient is that then for every target dataset, we must modify its corresponding _split_generators script method.

My point is that I don't think it is a good idea to implement both approaches. They could even interfere with each other!

If you agree, I would implement ONLY the second one, which is simpler, more consistent and stable and will avoid future problems.

@lhoestq
Copy link
Member

lhoestq commented Jun 25, 2021

Hi @albertvillanova !
Yup I agree with you, implementing the 2nd approach seems to be the right solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants