Allow downloading/processing/caching only specific splits #2249

albertvillanova · 2021-04-22T17:51:44Z

Allow downloading/processing/caching only specific splits without downloading/processing/caching the other splits.

This PR implements two steps to handle only specific splits:

it allows processing/caching only specific splits into Arrow files
for some simple cases, it allows downloading only specific splits (which is more intricate as it depends on the user-defined method _split_generators)

This PR makes several assumptions:

DownloadConfig contains the configuration settings for downloading
the parameter split passed to load_dataset is just a parameter for loading (from cache), not for downloading

lhoestq · 2021-04-26T16:44:59Z

src/datasets/utils/download_manager.py

+        if self._download_config.only_splits:
+            if isinstance(url_or_urls, dict) and all(
+                split in url_or_urls for split in self._download_config.only_splits
+            ):
+                url_or_urls = {split: url_or_urls[split] for split in self._download_config.only_splits}


That's a great trick ! However as you know this won't work on all cases unfortunately.
For example some datasets use a dictionary with keys that are not related to the splits since we allow any nested structure to be passed for downloading.

What do you think about adding an argument to _split_generators to tell it to return only a subset of the split generators ? In this case it would download only the files related to the requested splits

Then we can just edit the datasets that can benefit from this feature the most.

The problem is that _split_generators is already implemented in ALL canonical datasets without that parameter...

Maybe we can pass the parameter only if it's part of the signature of _split_generators ?
We're already doing that for other optional parameters like pipeline in BeamBasedBuilder._split_generators:

datasets/src/datasets/builder.py

Lines 1065 to 1067 in 909c58f

split_generators_arg_names = inspect.signature(self._split_generators).parameters.keys()

if "pipeline" in split_generators_arg_names:

split_generators_kwargs["pipeline"] = prepare_split_kwargs["pipeline"]

I'm fine with this option but let me know if you have other ideas

Now I pass the parameter splits to _split_generators. However, it must be handled properly by the method implementations.

Note that with my previous trick in download, it already works for natural_questions without even touching its current implementation of _split_generators. 😉

That is because in natural_questions:

_DOWNLOAD_URLS = { "train": ["%s/train/nq-train-%02d.jsonl.gz" % (_BASE_DOWNLOAD_URL, i) for i in range(50)], "validation": ["%s/dev/nq-dev-%02d.jsonl.gz" % (_BASE_DOWNLOAD_URL, i) for i in range(5)], }

lhoestq

Love it ! thank you :)

I added some comments:

src/datasets/builder.py

lhoestq · 2021-05-03T17:28:39Z

src/datasets/utils/download_manager.py

                url_or_urls.
        """
+        if self._download_config.splits:
+            if isinstance(url_or_urls, dict) and all(split in url_or_urls for split in self._download_config.splits):


I would add an additional check just to avoid unwanted behaviors.
For example right now if a dataset script passes to the dl_manager a dict like this:

{"main_data": url_to_main_data, "secondary_data": url_to_sec_data}

then this trick here would use url_or_urls = {} since the keys of the dict are not split names.

Maybe you could check that sorted(url_or_urls.keys()) == sorted(self._download_config.splits) before filtering ?

Edit:

I meant a dict like this

{"main_metadata": url_to_main_data, "secondary_metadata": url_to_sec_data, "train": url_train_data, "test": url_test_data}

Hi @lhoestq. Sorry, I'm not sure of understanding what you mean... 😅

What I am checking here is that all the keys in self._download_config.splits are a subset of the keys in url_or_urls.

If you pass a dictionary like this:

{"main_metadata": url_to_main_data, "secondary_metadata": url_to_sec_data, "train": url_train_data, "test": url_test_data}

then only the train or test keys will be kept, which I feel not intuitive.

For example if the users asks to load the "train" split, then the main and secondary metadata won't be downloaded.
You can fix that by keeping all the keys except the splits to ignore

Thanks @lhoestq, I understand it now.

See my comment below.

lhoestq · 2021-05-03T17:29:31Z

src/datasets/utils/download_manager.py

+            extracted_paths.data
+            if not isinstance(extracted_paths.data, dict)
+            else defaultdict(str, extracted_paths.data)
+        )


Why do you need this ?

Because the user implementation of the method _split_generators will access all possible keys in the extracted path dict and may certainly access them through __getitem__. If our downloaded paths haven't all possible splits (but only a subset), accessing the missing ones through __getitem__ will raise a KeyError exception.

If a dafaultdict is returned instead, trying to access a missing (not downloaded) splits will be silently ignored.

Makes sense ! Maybe use a more explicit placeholder that an empty string, in case users experience errors using this ?

I've set "<NOT_DOWNLOADED>" instead.

lhoestq · 2021-05-03T17:30:02Z

src/datasets/utils/file_utils.py

+    force_download: Optional[bool] = None  # default False
+    _is_force_download_default: bool = False


Why do you need this ?

This has to do with the fact of current over-use of DownloadConfig for many different tasks/concerns...

Note that DownloadConfig is passed to: prepare_module, cached_path, download_and_prepare, DownloadManager (which uses it for download_custom, download and extract)...

Anyway, in download_and_prepare if download_config is None, then a new instance is built AND it is set with some specific non-default attribute values (for example force_download=bool(download_mode == GenerateMode.FORCE_REDOWNLOAD)). What if download_config is not None (as the use case I am implementing where an already created instance of DownloadConfig is passed)? Then, for the example on the attribute force_download I have to consider 2 possibilities:

either the user instantiates DownloadConfig with an explicit value for force_download: then I want to preserve the value passed by the user

or the user instantiates DownloadConfig without an explicit value for force_download (and its default value is set): here I wanted to preserve current behavior of the function, that is, override the non-explicitly set value, with bool(download_mode == GenerateMode.FORCE_REDOWNLOAD). This way, people passing the DownloadConfig(splits=["train"]) will not have any other side effect because of the parameter force_download.

I just assumed that there is a reason for having some default values in DownloadConfig which are different from the ones set by default by download_and_prepare. Other option could be to change the current default values in DownloadConfig, but this may raise backward compatibility issues and impact current user experience.

I see !
What if the value force_download is updated during download_and_prepare ? In this case _is_force_download_default would still be True even though it's not the default value "None" anymore.
I find the name of the attribute confusing in this case. Maybe rename it _is_force_download_specified_by_user ?

I have renamed the attributes, e.g. _is_force_download_set_by_user.

lhoestq · 2021-05-03T17:33:23Z

src/datasets/utils/file_utils.py

    num_proc: Optional[int] = None
    max_retries: int = 1
    use_auth_token: Optional[Union[str, bool]] = None
+    splits: Optional[list] = None


The DownloadConfig class was used until now as a the class that defines the parameters that we pass to the cached_path function, and it had no logic related to the DownloadManager or to datasets specifically.
Therefore I'm not sure this is best place to put this argument. Let me know what you think.

Maybe this could be an argument of the DownloadManager itself.

I have not a clear opinion on this yet (I'm going to think about it), but I can tell you that for readability, a data class called DownloadConfig should be clearly used by a DownloadManager class.

In my opinion I find quite sensible to include a download configuration setting (whether to download all or only some of the files) in a DownloadConfig object. And in order to avoid passing lots of parameters to load_dataset, it makes sense to pass the parameters related to download stuff in a DownloadConfig class.

As a side note, I am planning to refactor also cached_path because it contains too many different coupled functionalities: download, extract, load from cache... For the moment, I have extracted all Extract functionalities. But I am planning further refactoring...

The separation between the two configs makes sense to me, and I totally agree with you with the naming.
Maybe we could have DownloadConfig as a subclass of CachedPathConfig or something like that (or another relation between the two).

Regarding cached_path, note that this is a function that we have in common with the other libraries (transformers and huggingface_hub) so it could be worth discussing with the other maintainers about the changes we want to do.

tests/test_load.py

lhoestq · 2021-05-03T17:39:24Z

tests/test_load.py

    assert (dataset["train"].dataset_size < max_in_memory_dataset_size) is expected_in_memory


+class TestLoadDatasetOnlySplits:


We should also check this scenario:

users load a dataset with the "train" split only

users reloads this dataset but this time asking for the "test" set only

The caching mechanism should notice that the "test" set is missing and download and prepare it. If I'm not wrong currently this would fail because of this line:

datasets/src/datasets/builder.py

Lines 504 to 511 in 097129d

data_exists = os.path.exists(self._cache_dir)

if data_exists and download_mode == GenerateMode.REUSE_DATASET_IF_EXISTS:

logger.warning("Reusing dataset %s (%s)", self.name, self._cache_dir)

# We need to update the info in case some splits were added in the meantime

# for example when calling load_dataset from multiple workers.

self.info = self._load_info()

self.download_post_processing_resources(dl_manager)

return

Instead of checking for the cache_dir, we should check for the actual arrow split file to exist inside cache_dir

Current implementation of load_dataset assumes that if there is something in the cache, then it is all you can get, unless you pass force_download.

I assumed that if the user passes a parameter to download only one split, then they are aware that only that split is in the cache and only that split can be loaded. In order to force a subsequent download of other split, they should pass force_download.

Not a big fan of that. In my opinion it should detect that the requested split is missing and generate it.
This will avoid some confusions for users

This is also a breaking change we can't afford IMO

albertvillanova · 2021-06-24T17:26:37Z

If you pass a dictionary like this:
{"main_metadata": url_to_main_data,
"secondary_metadata": url_to_sec_data,
"train": url_train_data,
"test": url_test_data}
then only the train or test keys will be kept, which I feel not intuitive.

For example if the users asks to load the "train" split, then the main and secondary metadata won't be downloaded.
You can fix that by keeping all the keys except the splits to ignore

Hi @lhoestq, I have been thinking about this and I think it is worth that we discuss about it.

When I created this PR, my first idea was to create a "hack" inside the download manager that will be able to filter some split(s) without touching any dataset script. Of course, the download manager does not know about splits logic, and thus this trick would only work for some very specific datasets: only the ones containing that pass a dict to the download manager containing only the keys "train", "validation", "test" (or the one passed by the user for advanced users knowing they can do it), e.g. the natural_questions dataset (which was one of the targets).

The big inconvenient of this approach is that it is not applicable to many datasets (or worse, it should be constantly tweaked to cope with exceptional cases). One exceptional case is the one you pointed out. But I see others:

the split keys can be different: train, test, dev, val, validation, eval,...
in hope_edi dataset, the split keys are: TRAIN_DOWNLOAD_URL, VALIDATION_DOWNLOAD_URL
in few_rel dataset, the split keys are: train_wiki, val_nyt, val_pubmed,..., pid2name
in curiosity_dialogs, the split keys are: train, val, test, test_zero; this means that for every split we pass, we will also get test_zero
in deal_or_no_dialog, each of the splits URL is passed separately to the download manager, so all splits would be always downloaded
etc.

Then after discussing, another idea emerged: pass a split parameter to _split_generators, which know about the splits logic, so that it can handle which splits are passed to the download manager. This approach is more accurate and can be tweaked so that it works with all the datasets we want. The only inconvenient is that then for every target dataset, we must modify its corresponding _split_generators script method.

My point is that I don't think it is a good idea to implement both approaches. They could even interfere with each other!

If you agree, I would implement ONLY the second one, which is simpler, more consistent and stable and will avoid future problems.

lhoestq · 2021-06-25T15:44:26Z

Hi @albertvillanova !
Yup I agree with you, implementing the 2nd approach seems to be the right solution

albertvillanova added 8 commits April 22, 2021 19:32

Add option only_splits to DownloadConfig

7544616

Test load_datasets only_splits

43724e5

Filter split_generators in only_splits

dff9fed

Mark DownloadConfig params with default values

e0e1aa5

Handle passed DownloadConfig with only_splits only

6a0c3fc

Fix test

75ea3d4

Fix DownloadConfig

d4d7f28

Fix MockDownloadManager

07c2c02

albertvillanova added the enhancement New feature or request label Apr 22, 2021

albertvillanova added this to the 1.7 milestone Apr 22, 2021

albertvillanova changed the title ~~Allow loading only specific splits~~ Allow downloading/processing/caching only specific splits Apr 23, 2021

albertvillanova changed the title ~~Allow downloading/processing/caching only specific splits~~ Allow processing/caching only specific splits Apr 23, 2021

albertvillanova added 4 commits April 23, 2021 16:50

Refactorize test

e231072

Test downloaded files by load_datasets only_splits

2f06ae4

Download only splits in only_splits

84ac280

Fix returned extracted paths to handle missing keys

90be20e

albertvillanova changed the title ~~Allow processing/caching only specific splits~~ Allow downloading/processing/caching only specific splits Apr 23, 2021

Fix returned extracted paths if dict

3405398

lhoestq reviewed Apr 26, 2021

View reviewed changes

albertvillanova added 5 commits April 28, 2021 15:01

Rename DownloadConfig.only_splits to DownloadConfig.splits

3a137c5

Pass splits to _split_generators

f65be54

Merge remote-tracking branch 'upstream/master' into load-only-splits

67e86a9

Fix undefined variable

f1b6e3d

Merge remote-tracking branch 'upstream/master' into load-only-splits

f733347

albertvillanova force-pushed the load-only-splits branch from ed62a47 to f733347 Compare May 3, 2021 14:53

lhoestq reviewed May 3, 2021

View reviewed changes

albertvillanova modified the milestones: 1.7, 1.8 May 31, 2021

albertvillanova modified the milestones: 1.8, 1.9 Jun 8, 2021

albertvillanova added 3 commits June 23, 2021 14:16

Merge remote-tracking branch 'upstream/master' into load-only-splits

111e32a

Use a explicit not_downloaded placeholder in case users get errors

c398de5

Rename DownloadConfig attributes

a001084

albertvillanova mentioned this pull request Jun 24, 2021

Loading partial dataset when debugging #2538

Open

Merge remote-tracking branch 'upstream/master' into load-only-splits

0678193

albertvillanova modified the milestones: 1.9, 1.10 Jul 9, 2021

albertvillanova modified the milestones: 1.10, 1.11 Jul 21, 2021

albertvillanova modified the milestones: 1.11, 1.12 Sep 2, 2021

mariosasko mentioned this pull request May 26, 2022

Dataset librispeech_asr fails to load #4179

Closed

albertvillanova mentioned this pull request Mar 24, 2023

Support for downloading only provided split #5668

Draft

ArjunJagdale mentioned this pull request Jul 20, 2025

Fix misleading add_column() usage example in docstring #7648

Merged

ArjunJagdale mentioned this pull request Jul 28, 2025

Support downloading specific splits in load_dataset #6832

Open

	split_generators_arg_names = inspect.signature(self._split_generators).parameters.keys()
	if "pipeline" in split_generators_arg_names:
	split_generators_kwargs["pipeline"] = prepare_split_kwargs["pipeline"]

		force_download: Optional[bool] = None # default False
		_is_force_download_default: bool = False

		assert (dataset["train"].dataset_size < max_in_memory_dataset_size) is expected_in_memory


		class TestLoadDatasetOnlySplits:

	data_exists = os.path.exists(self._cache_dir)
	if data_exists and download_mode == GenerateMode.REUSE_DATASET_IF_EXISTS:
	logger.warning("Reusing dataset %s (%s)", self.name, self._cache_dir)
	# We need to update the info in case some splits were added in the meantime
	# for example when calling load_dataset from multiple workers.
	self.info = self._load_info()
	self.download_post_processing_resources(dl_manager)
	return

Allow downloading/processing/caching only specific splits #2249

Are you sure you want to change the base?

Allow downloading/processing/caching only specific splits #2249

Uh oh!

Conversation

albertvillanova commented Apr 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albertvillanova Apr 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albertvillanova Apr 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lhoestq May 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albertvillanova Jun 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lhoestq Jun 23, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albertvillanova May 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albertvillanova May 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lhoestq May 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albertvillanova May 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

albertvillanova commented Apr 22, 2021 •

edited

Loading

albertvillanova Apr 26, 2021 •

edited

Loading

albertvillanova Apr 28, 2021 •

edited

Loading

lhoestq May 3, 2021 •

edited

Loading

albertvillanova Jun 23, 2021 •

edited

Loading

lhoestq Jun 23, 2021 •

edited

Loading

albertvillanova May 3, 2021 •

edited

Loading

albertvillanova May 3, 2021 •

edited

Loading

lhoestq May 10, 2021 •

edited

Loading

albertvillanova May 3, 2021 •

edited

Loading