datasets/tests at e4dc7c74443e2322996292802bd4030eb20689f6 · huggingface/datasets

History

Name		Name	Last commit message	Last commit date
parent directory ..
commands		commands
distributed_scripts		distributed_scripts
features		features
fixtures		fixtures
io		io
packaged_modules		packaged_modules
README.md		README.md
__init__.py		__init__.py
_test_patching.py		_test_patching.py
conftest.py		conftest.py
test_arrow_dataset.py		test_arrow_dataset.py
test_arrow_reader.py		test_arrow_reader.py
test_arrow_writer.py		test_arrow_writer.py
test_beam.py		test_beam.py
test_builder.py		test_builder.py
test_data_files.py		test_data_files.py
test_dataset_dict.py		test_dataset_dict.py
test_dataset_list.py		test_dataset_list.py
test_distributed.py		test_distributed.py
test_download_manager.py		test_download_manager.py
test_experimental.py		test_experimental.py
test_extract.py		test_extract.py
test_file_utils.py		test_file_utils.py
test_filelock.py		test_filelock.py
test_filesystem.py		test_filesystem.py
test_fingerprint.py		test_fingerprint.py
test_formatting.py		test_formatting.py
test_hf_gcp.py		test_hf_gcp.py
test_hub.py		test_hub.py
test_info.py		test_info.py
test_info_utils.py		test_info_utils.py
test_inspect.py		test_inspect.py
test_iterable_dataset.py		test_iterable_dataset.py
test_load.py		test_load.py
test_metadata_util.py		test_metadata_util.py
test_metric.py		test_metric.py
test_metric_common.py		test_metric_common.py
test_offline_util.py		test_offline_util.py
test_parallel.py		test_parallel.py
test_patching.py		test_patching.py
test_py_utils.py		test_py_utils.py
test_readme_util.py		test_readme_util.py
test_search.py		test_search.py
test_sharding_utils.py		test_sharding_utils.py
test_splits.py		test_splits.py
test_streaming_download_manager.py		test_streaming_download_manager.py
test_table.py		test_table.py
test_tasks.py		test_tasks.py
test_tqdm.py		test_tqdm.py
test_upstream_hub.py		test_upstream_hub.py
test_version.py		test_version.py
test_warnings.py		test_warnings.py
utils.py		utils.py

README.md

Add Dummy data test

Important In order to pass the load_dataset_<dataset_name> test, dummy data is required for all possible config names.

First we distinguish between datasets scripts that

A) have no config class and
B) have a config class

For A) the dummy data folder structure, will always look as follows:

dummy/<version>/dummy_data.zip, e.g. cosmos_qa/dummy/0.1.0/dummy_data.zip. For B) the dummy data folder structure, will always look as follows:
dummy/<config_name>/<version>/dummy_data.zip, e.g. squad/dummy/plain-text/1.0.0/dummy_data.zip.

Now the difficult part is to create the correct dummy_data.zip file.

Important When checking the dummy folder structure of already added datasets, always unzip dummy_data.zip. If a folder dummy_data is found next to dummy_data.zip, it is probably an old version and should be deleted. The tests only take the dummy_data.zip file into account.

Here we have to pay close attention to the _split_generators(self, dl_manager) function of the dataset script in question. There are three general possibilties:

The dl_manager.download_and_extract() is given a single path variable of type str as its argument. In this case the file dummy_data.zip should unzip to the following structure: os.path.join("dummy_data", <additional-paths-as-defined-in-split-generations>) e.g. for sentiment140, the unzipped dummy_data.zip has the following dir structure dummy_data/testdata.manual.2009.06.14.csv and dummy_data/training.1600000.processed.noemoticon.csv.

Note if there are no <additional-paths-as-defined-in-split-generations>, then dummy_data should be the name of the single file. An example for this is the crime-and-punishment dataset script.

The dl_manager.download_and_extract() is given a dictionary of paths of type str as its argument. In this case the file dummy_data.zip should unzip to the following structure: os.path.join("dummy_data", <value_of_dict>.split('/')[-1], <additional-paths-as-defined-in-split-generations>) e.g. for squad, the unzipped dummy_data.zip has the following dir structure dummy_data/dev-v1.1.json, etc...

Note if <value_of_dict> is a zipped file then the dummy data folder structure should contain the exact name of the zipped file and the following extracted folder structure. The file dummy_data.zip should never itself contain a zipped file since the dummy data is not unzipped by the MockDownloadManager during testing. E.g. check the dummy folder structure of hansards where the folders have to be named *.tar or the structure of wiki_split where the folders have to be named *.zip.

The dl_manager.download_and_extract() is given a dictionary of lists of paths of type str as its argument. This is a very special case and has been seen only for the dataset ensli. In this case the values are simply flattened and the dummy folder structure is the same as in 2).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tests

tests

README.md

Add Dummy data test

Files

tests

Directory actions

More options

Directory actions

More options

Latest commit

History

tests

Folders and files

parent directory

README.md

Add Dummy data test