Skip to content

Commit

Permalink
Add cross-platform support for datasets-cli (#1951)
Browse files Browse the repository at this point in the history
  • Loading branch information
mariosasko authored Feb 26, 2021
1 parent cadd48e commit 987df6b
Show file tree
Hide file tree
Showing 7 changed files with 44 additions and 29 deletions.
19 changes: 12 additions & 7 deletions ADD_NEW_DATASET.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,12 +144,12 @@ Sometimes you need to use several *configurations* and/or *splits* (usually at l
**Last step:** To check that your dataset works correctly and to create its `dataset_infos.json` file run the command:

```bash
python datasets-cli test datasets/<your-dataset-folder> --save_infos --all_configs
datasets-cli test datasets/<your-dataset-folder> --save_infos --all_configs
```

**Note:** If your dataset requires manually downloading the data and having the user provide the path to the dataset you can run the following command:
```bash
python datasets-cli test datasets/<your-dataset-folder> --save_infos --all_configs --data_dir your/manual/dir
datasets-cli test datasets/<your-dataset-folder> --save_infos --all_configs --data_dir your/manual/dir
```
To have the configs use the path from `--data_dir` when generating them.

Expand All @@ -164,19 +164,19 @@ Now that your dataset script runs and create a dataset with the format you expec
If the extensions of the raw data files of your dataset are in this list, then you can automatically generate your dummy data with:

```bash
python datasets-cli dummy_data datasets/<your-dataset-folder> --auto_generate
datasets-cli dummy_data datasets/<your-dataset-folder> --auto_generate
```

Example:

```bash
python datasets-cli dummy_data ./datasets/snli --auto_generate
datasets-cli dummy_data ./datasets/snli --auto_generate
```

If your data files are not in the supported format, you can run the same command without the `--auto_generate` flag. It should give you instructions on the files to manually create (basically, the same ones as for the real dataset but with only five items).

```bash
python datasets-cli dummy_data datasets/<your-dataset-folder>
datasets-cli dummy_data datasets/<your-dataset-folder>
```

If this doesn't work more information on how to add dummy data can be found in the documentation [here](https://huggingface.co/docs/datasets/share_dataset.html#adding-dummy-data).
Expand Down Expand Up @@ -208,14 +208,19 @@ Go to the next step (open a Pull Request) and we'll help you cross the finish li
3. If all tests pass, your dataset works correctly. You can finally create the metadata JSON by running the command:

```bash
python datasets-cli test datasets/<your-dataset-folder> --save_infos --all_configs
datasets-cli test datasets/<your-dataset-folder> --save_infos --all_configs
```

This first command should create a `dataset_infos.json` file in your dataset folder.


You have now finished the coding part, congratulation! 🎉 You are Awesome! 😎

Note: You can use the CLI tool from the root of the repository with the following command:
```bash
python src/datasets/commands/datasets_cli.py <command>
```

### Open a Pull Request on the main HuggingFace repo and share your work!!

Here are the step to open the Pull-Request on the main repo.
Expand Down
13 changes: 9 additions & 4 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,20 +70,20 @@ A [more complete guide](https://github.com/huggingface/datasets/blob/master/ADD_
3. **Make sure you run all of the following commands from the root of your `datasets` git clone.** To check that your dataset works correctly and to create its `dataset_infos.json` file run the command:

```bash
python datasets-cli test datasets/<your-dataset-folder> --save_infos --all_configs
datasets-cli test datasets/<your-dataset-folder> --save_infos --all_configs
```

4. If the command was succesful, you should now create some dummy data. Use the following command to get in-detail instructions on how to create the dummy data:

```bash
python datasets-cli dummy_data datasets/<your-dataset-folder>
datasets-cli dummy_data datasets/<your-dataset-folder>
```

There is a tool that automatically generates dummy data for you. At the moment it supports data files in the following format: txt, csv, tsv, jsonl, json, xml.
If the extensions of the raw data files of your dataset are in this list, then you can automatically generate your dummy data with:

```bash
python datasets-cli dummy_data datasets/<your-dataset-folder> --auto_generate
datasets-cli dummy_data datasets/<your-dataset-folder> --auto_generate
```

5. Now test that both the real data and the dummy data work correctly using the following commands:
Expand All @@ -110,7 +110,7 @@ Follow these steps in case the dummy data test keeps failing:

- Verify that all filenames are spelled correctly. Rerun the command
```bash
python datasets-cli dummy_data datasets/<your-dataset-folder>
datasets-cli dummy_data datasets/<your-dataset-folder>
```
and make sure you follow the exact instructions provided by the command of step 5).

Expand All @@ -120,6 +120,11 @@ Follow these steps in case the dummy data test keeps failing:

If you're looking for more details about dataset scripts creation, please refer to the [documentation](https://huggingface.co/docs/datasets/add_dataset.html).

Note: You can use the CLI tool from the root of the repository with the following command:
```bash
python src/datasets/commands/datasets_cli.py <command>
```

## How to contribute to the dataset cards

Improving the documentation of datasets is an ever increasing effort and we invite users to contribute by sharing their insights with the community in the `README.md` dataset cards provided for each dataset.
Expand Down
6 changes: 3 additions & 3 deletions convert_dataset.sh
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ if [ -f "${pathToFolder}/${datasetName}.py" ]; then
echo "### STEP 1 ### ${datasetName} is already converted. To convert it again remove ${pathToFolder}/${datasetName}."
else
echo "### STEP 1 ### Converting ${datasetName} dataset ..."
eval "python datasets-cli convert --tfds_path ${pathToFile} --datasets_directory datasets/"
eval "datasets-cli convert --tfds_path ${pathToFile} --datasets_directory datasets/"
fi

if [ -f "${pathToFolder}/${datasetName}.py" ]; then
Expand All @@ -51,9 +51,9 @@ if [ -f "${pathToFolder}/dataset_infos.json" ]; then
else
echo "### STEP 2 ### Create infos ..."
if [ -z "${manual_dir}" ]; then
eval "python datasets-cli test ${pathToFolder} --save_infos --all_configs"
eval "datasets-cli test ${pathToFolder} --save_infos --all_configs"
else
eval "python datasets-cli test ${pathToFolder} --data_dir ${manual_dir} --save_infos --all_configs"
eval "datasets-cli test ${pathToFolder} --data_dir ${manual_dir} --save_infos --all_configs"
fi
fi

Expand Down
2 changes: 1 addition & 1 deletion docs/source/beam_dataset.rst
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ If you want to run the Beam pipeline of a dataset anyway, here are the different

.. code::
python -mdatasets-cli run_beam datasets/$DATASET_NAME \
datasets-cli run_beam datasets/$DATASET_NAME \
--name $CONFIG_NAME \
--save_infos \
--cache_dir gs://$BUCKET/cache/datasets \
Expand Down
14 changes: 7 additions & 7 deletions docs/source/share_dataset.rst
Original file line number Diff line number Diff line change
Expand Up @@ -308,7 +308,7 @@ You can check that the new dataset loading script works correctly and create the

.. code-block::
python datasets-cli test datasets/<your-dataset-folder> --save_infos --all_configs
datasets-cli test datasets/<your-dataset-folder> --save_infos --all_configs
If the command was succesful, you should now have a ``dataset_infos.json`` file created in the folder of your dataset loading script. Here is a dummy example of the content for a dataset with a single configuration:

Expand Down Expand Up @@ -379,7 +379,7 @@ Now that we have the metadata prepared we can also create some dummy data for au

.. code-block::
python datasets-cli dummy_data datasets/<your-dataset-folder>
datasets-cli dummy_data datasets/<your-dataset-folder>
This command will output instructions specifically tailored to your dataset and will look like:

Expand Down Expand Up @@ -408,15 +408,15 @@ If the extensions of the raw data files of your dataset are in this list, then y

.. code-block::
python datasets-cli dummy_data datasets/<your-dataset-folder> --auto_generate
datasets-cli dummy_data datasets/<your-dataset-folder> --auto_generate
Examples:

.. code-block::
python datasets-cli dummy_data ./datasets/snli --auto_generate
python datasets-cli dummy_data ./datasets/squad --auto_generate --json_field data
python datasets-cli dummy_data ./datasets/iwslt2017 --auto_generate --xml_tag seg --match_text_files "train*" --n_lines 15
datasets-cli dummy_data ./datasets/snli --auto_generate
datasets-cli dummy_data ./datasets/squad --auto_generate --json_field data
datasets-cli dummy_data ./datasets/iwslt2017 --auto_generate --xml_tag seg --match_text_files "train*" --n_lines 15
# --xml_tag seg => each sample corresponds to a "seg" tag in the xml tree
# --match_text_files "train*" => also match text files that don't have a proper text file extension (no suffix like ".txt" for example)
# --n_lines 15 => some text files have headers so we have to use at least 15 lines
Expand Down Expand Up @@ -489,7 +489,7 @@ If all tests pass, your dataset works correctly. Awesome! You can now follow the

.. code-block::
python datasets-cli dummy_data datasets/<your-dataset-folder>
datasets-cli dummy_data datasets/<your-dataset-folder>
and make sure you follow the exact instructions provided by the command.

Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -204,7 +204,7 @@
"scripts/templates/*",
],
},
scripts=["datasets-cli"],
entry_points={"console_scripts": ["datasets-cli=datasets.commands.datasets_cli:main"]},
install_requires=REQUIRED_PKGS,
extras_require=EXTRAS_REQUIRE,
classifiers=[
Expand Down
17 changes: 11 additions & 6 deletions datasets-cli → src/datasets/commands/datasets_cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,16 @@

from datasets.commands.convert import ConvertCommand
from datasets.commands.download import DownloadCommand
from datasets.commands.dummy_data import DummyDataCommand
from datasets.commands.env import EnvironmentCommand
from datasets.commands.test import TestCommand
from datasets.commands.run_beam import RunBeamCommand
from datasets.commands.dummy_data import DummyDataCommand
from datasets.commands.test import TestCommand
from datasets.utils.logging import set_verbosity_info

if __name__ == '__main__':
parser = ArgumentParser('HuggingFace Datasets CLI tool', usage='datasets-cli <command> [<args>]')
commands_parser = parser.add_subparsers(help='datasets-cli command helpers')

def main():
parser = ArgumentParser("HuggingFace Datasets CLI tool", usage="datasets-cli <command> [<args>]")
commands_parser = parser.add_subparsers(help="datasets-cli command helpers")
set_verbosity_info()

# Register commands
Expand All @@ -25,10 +26,14 @@
# Let's go
args = parser.parse_args()

if not hasattr(args, 'func'):
if not hasattr(args, "func"):
parser.print_help()
exit(1)

# Run
service = args.func(args)
service.run()


if __name__ == "__main__":
main()

1 comment on commit 987df6b

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==0.17.1

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.017567 / 0.011353 (0.006214) 0.015238 / 0.011008 (0.004230) 0.050593 / 0.038508 (0.012085) 0.039841 / 0.023109 (0.016732) 0.211628 / 0.275898 (-0.064270) 0.265909 / 0.323480 (-0.057571) 0.005319 / 0.007986 (-0.002667) 0.004608 / 0.004328 (0.000279) 0.008271 / 0.004250 (0.004020) 0.055659 / 0.037052 (0.018606) 0.211829 / 0.258489 (-0.046660) 0.253500 / 0.293841 (-0.040341) 0.156957 / 0.128546 (0.028411) 0.114068 / 0.075646 (0.038421) 0.446682 / 0.419271 (0.027411) 0.423945 / 0.043533 (0.380412) 0.210210 / 0.255139 (-0.044929) 0.238418 / 0.283200 (-0.044782) 1.739540 / 0.141683 (1.597857) 1.875183 / 1.452155 (0.423029) 2.017085 / 1.492716 (0.524369)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.042487 / 0.037411 (0.005076) 0.019896 / 0.014526 (0.005371) 0.026964 / 0.176557 (-0.149592) 0.048420 / 0.737135 (-0.688716) 0.055410 / 0.296338 (-0.240929)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.227567 / 0.215209 (0.012358) 2.274789 / 2.077655 (0.197134) 1.286020 / 1.504120 (-0.218100) 1.163024 / 1.541195 (-0.378171) 1.211820 / 1.468490 (-0.256670) 6.582974 / 4.584777 (1.998197) 5.993825 / 3.745712 (2.248113) 8.206963 / 5.269862 (2.937102) 7.228790 / 4.565676 (2.663113) 0.608703 / 0.424275 (0.184428) 0.010694 / 0.007607 (0.003087) 0.259782 / 0.226044 (0.033738) 2.703082 / 2.268929 (0.434153) 1.739875 / 55.444624 (-53.704749) 1.547534 / 6.876477 (-5.328943) 1.607167 / 2.142072 (-0.534905) 6.656406 / 4.805227 (1.851178) 6.331402 / 6.500664 (-0.169262) 8.245505 / 0.075469 (8.170036)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 10.309891 / 1.841788 (8.468103) 15.517243 / 8.074308 (7.442935) 17.330151 / 10.191392 (7.138759) 0.480750 / 0.680424 (-0.199674) 0.290721 / 0.534201 (-0.243480) 0.893870 / 0.579283 (0.314587) 0.566911 / 0.434364 (0.132547) 0.670112 / 0.540337 (0.129774) 1.498878 / 1.386936 (0.111942)
PyArrow==1.0
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.016369 / 0.011353 (0.005016) 0.014744 / 0.011008 (0.003736) 0.045443 / 0.038508 (0.006935) 0.036267 / 0.023109 (0.013158) 0.338517 / 0.275898 (0.062619) 0.374811 / 0.323480 (0.051331) 0.004749 / 0.007986 (-0.003236) 0.004568 / 0.004328 (0.000239) 0.007345 / 0.004250 (0.003095) 0.055569 / 0.037052 (0.018517) 0.337889 / 0.258489 (0.079400) 0.383008 / 0.293841 (0.089167) 0.140213 / 0.128546 (0.011667) 0.117444 / 0.075646 (0.041797) 0.442361 / 0.419271 (0.023089) 0.405459 / 0.043533 (0.361926) 0.336251 / 0.255139 (0.081112) 0.368280 / 0.283200 (0.085080) 1.711196 / 0.141683 (1.569514) 1.859199 / 1.452155 (0.407044) 1.913589 / 1.492716 (0.420873)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.042923 / 0.037411 (0.005512) 0.021655 / 0.014526 (0.007129) 0.090128 / 0.176557 (-0.086429) 0.049087 / 0.737135 (-0.688048) 0.032439 / 0.296338 (-0.263899)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.293530 / 0.215209 (0.078321) 2.952296 / 2.077655 (0.874641) 1.936740 / 1.504120 (0.432620) 1.827431 / 1.541195 (0.286236) 1.879376 / 1.468490 (0.410886) 6.286498 / 4.584777 (1.701722) 5.555648 / 3.745712 (1.809936) 7.847458 / 5.269862 (2.577597) 7.045829 / 4.565676 (2.480152) 0.650825 / 0.424275 (0.226550) 0.010249 / 0.007607 (0.002642) 0.335669 / 0.226044 (0.109624) 3.347644 / 2.268929 (1.078715) 2.323193 / 55.444624 (-53.121432) 2.139879 / 6.876477 (-4.736598) 2.210068 / 2.142072 (0.067995) 6.378095 / 4.805227 (1.572868) 3.947257 / 6.500664 (-2.553407) 7.683651 / 0.075469 (7.608182)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 10.681646 / 1.841788 (8.839858) 14.479535 / 8.074308 (6.405227) 18.273333 / 10.191392 (8.081941) 0.780357 / 0.680424 (0.099934) 0.595953 / 0.534201 (0.061752) 0.709879 / 0.579283 (0.130596) 0.549540 / 0.434364 (0.115176) 0.639795 / 0.540337 (0.099458) 1.560044 / 1.386936 (0.173108)

CML watermark

Please sign in to comment.