Add cross-platform support for datasets-cli (#1951)

huggingface · Feb 26, 2021 · 987df6b · 987df6b · github-actions · Feb 26, 2021
1 parent cadd48e
commit 987df6b
Show file tree

Hide file tree

Showing 7 changed files with 44 additions and 29 deletions.
diff --git a/ADD_NEW_DATASET.md b/ADD_NEW_DATASET.md
@@ -144,12 +144,12 @@ Sometimes you need to use several *configurations* and/or *splits* (usually at l
 **Last step:** To check that your dataset works correctly and to create its `dataset_infos.json` file run the command:
 
 ```bash
-python datasets-cli test datasets/<your-dataset-folder> --save_infos --all_configs
+datasets-cli test datasets/<your-dataset-folder> --save_infos --all_configs
 ```
 
 **Note:** If your dataset requires manually downloading the data and having the user provide the path to the dataset you can run the following command:
 ```bash
-python datasets-cli test datasets/<your-dataset-folder> --save_infos --all_configs --data_dir your/manual/dir
+datasets-cli test datasets/<your-dataset-folder> --save_infos --all_configs --data_dir your/manual/dir
 ```
 To have the configs use the path from `--data_dir` when generating them.
 
@@ -164,19 +164,19 @@ Now that your dataset script runs and create a dataset with the format you expec
 If the extensions of the raw data files of your dataset are in this list, then you can automatically generate your dummy data with:
 
 	```bash
- 	python datasets-cli dummy_data datasets/<your-dataset-folder> --auto_generate
+	datasets-cli dummy_data datasets/<your-dataset-folder> --auto_generate
 	```
 
 	Example:
 
 	```bash
- 	python datasets-cli dummy_data ./datasets/snli --auto_generate
+	datasets-cli dummy_data ./datasets/snli --auto_generate
 	```
 
 If your data files are not in the supported format, you can run the same command without the `--auto_generate` flag. It should give you instructions on the files to manually create (basically, the same ones as for the real dataset but with only five items).
-
+	
 	```bash
- 	python datasets-cli dummy_data datasets/<your-dataset-folder>
+	datasets-cli dummy_data datasets/<your-dataset-folder>
 	```
 
 If this doesn't work more information on how to add dummy data can be found in the documentation [here](https://huggingface.co/docs/datasets/share_dataset.html#adding-dummy-data).
@@ -208,14 +208,19 @@ Go to the next step (open a Pull Request) and we'll help you cross the finish li
 3. If all tests pass, your dataset works correctly. You can finally create the metadata JSON by running the command:
 
 	```bash
-	python datasets-cli test datasets/<your-dataset-folder> --save_infos --all_configs
+	datasets-cli test datasets/<your-dataset-folder> --save_infos --all_configs
 	```
 
 	This first command should create a `dataset_infos.json` file in your dataset folder.
 
 
 You have now finished the coding part, congratulation! 🎉 You are Awesome! 😎
 
+Note: You can use the CLI tool from the root of the repository with the following command:
+```bash
+python src/datasets/commands/datasets_cli.py <command>
+```
+
 ### Open a Pull Request on the main HuggingFace repo and share your work!!
 
 Here are the step to open the Pull-Request on the main repo.

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -70,20 +70,20 @@ A [more complete guide](https://github.com/huggingface/datasets/blob/master/ADD_
 3. **Make sure you run all of the following commands from the root of your `datasets` git clone.** To check that your dataset works correctly and to create its `dataset_infos.json` file run the command:
 
 	```bash
-	python datasets-cli test datasets/<your-dataset-folder> --save_infos --all_configs
+	datasets-cli test datasets/<your-dataset-folder> --save_infos --all_configs
 	```
 
 4. If the command was succesful, you should now create some dummy data. Use the following command to get in-detail instructions on how to create the dummy data:
 
 	```bash
-	python datasets-cli dummy_data datasets/<your-dataset-folder>
+	datasets-cli dummy_data datasets/<your-dataset-folder>
 	```
 
 	There is a tool that automatically generates dummy data for you. At the moment it supports data files in the following format: txt, csv, tsv, jsonl, json, xml.
 	If the extensions of the raw data files of your dataset are in this list, then you can automatically generate your dummy data with:
 
 	```bash
-	python datasets-cli dummy_data datasets/<your-dataset-folder> --auto_generate
+	datasets-cli dummy_data datasets/<your-dataset-folder> --auto_generate
 	```
 
 5. Now test that both the real data and the dummy data work correctly using the following commands:
@@ -110,7 +110,7 @@ Follow these steps in case the dummy data test keeps failing:
 
 - Verify that all filenames are spelled correctly. Rerun the command
 	```bash
-	python datasets-cli dummy_data datasets/<your-dataset-folder>
+	datasets-cli dummy_data datasets/<your-dataset-folder>
 	```
 	and make sure you follow the exact instructions provided by the command of step 5).
 
@@ -120,6 +120,11 @@ Follow these steps in case the dummy data test keeps failing:
 
 If you're looking for more details about dataset scripts creation, please refer to the [documentation](https://huggingface.co/docs/datasets/add_dataset.html).
 
+Note: You can use the CLI tool from the root of the repository with the following command:
+```bash
+python src/datasets/commands/datasets_cli.py <command>
+```
+
 ## How to contribute to the dataset cards
 
 Improving the documentation of datasets is an ever increasing effort and we invite users to contribute by sharing their insights with the community in the `README.md` dataset cards provided for each dataset.

diff --git a/convert_dataset.sh b/convert_dataset.sh
@@ -30,7 +30,7 @@ if [ -f "${pathToFolder}/${datasetName}.py" ]; then
 	echo "### STEP 1 ### ${datasetName} is already converted. To convert it again remove ${pathToFolder}/${datasetName}."
 else
 	echo "### STEP 1 ### Converting ${datasetName} dataset ..."
-	eval "python datasets-cli convert --tfds_path ${pathToFile} --datasets_directory datasets/"
+	eval "datasets-cli convert --tfds_path ${pathToFile} --datasets_directory datasets/"
 fi
 
 if [ -f "${pathToFolder}/${datasetName}.py" ]; then
@@ -51,9 +51,9 @@ if [ -f "${pathToFolder}/dataset_infos.json" ]; then
 else
 	echo "### STEP 2 ### Create infos ..."
 	if [ -z "${manual_dir}" ]; then
-		eval "python datasets-cli test ${pathToFolder} --save_infos --all_configs"
+		eval "datasets-cli test ${pathToFolder} --save_infos --all_configs"
 	else
-		eval "python datasets-cli test ${pathToFolder} --data_dir ${manual_dir} --save_infos --all_configs"
+		eval "datasets-cli test ${pathToFolder} --data_dir ${manual_dir} --save_infos --all_configs"
 	fi
 fi
 

diff --git a/docs/source/beam_dataset.rst b/docs/source/beam_dataset.rst
@@ -46,7 +46,7 @@ If you want to run the Beam pipeline of a dataset anyway, here are the different
 
 .. code::
 
-    python -mdatasets-cli run_beam datasets/$DATASET_NAME \
+    datasets-cli run_beam datasets/$DATASET_NAME \
     --name $CONFIG_NAME \
     --save_infos \
     --cache_dir gs://$BUCKET/cache/datasets \

diff --git a/docs/source/share_dataset.rst b/docs/source/share_dataset.rst
@@ -308,7 +308,7 @@ You can check that the new dataset loading script works correctly and create the
 
 .. code-block::
 
-    python datasets-cli test datasets/<your-dataset-folder> --save_infos --all_configs
+    datasets-cli test datasets/<your-dataset-folder> --save_infos --all_configs
 
 If the command was succesful, you should now have a ``dataset_infos.json`` file created in the folder of your dataset loading script. Here is a dummy example of the content for a dataset with a single configuration:
 
@@ -379,7 +379,7 @@ Now that we have the metadata prepared we can also create some dummy data for au
 
 .. code-block::
 
-    python datasets-cli dummy_data datasets/<your-dataset-folder>
+    datasets-cli dummy_data datasets/<your-dataset-folder>
 
 This command will output instructions specifically tailored to your dataset and will look like:
 
@@ -408,15 +408,15 @@ If the extensions of the raw data files of your dataset are in this list, then y
 
 .. code-block::
 
-    python datasets-cli dummy_data datasets/<your-dataset-folder> --auto_generate
+    datasets-cli dummy_data datasets/<your-dataset-folder> --auto_generate
 
 Examples:
 
 .. code-block::
 
-    python datasets-cli dummy_data ./datasets/snli --auto_generate
-    python datasets-cli dummy_data ./datasets/squad --auto_generate --json_field data
-    python datasets-cli dummy_data ./datasets/iwslt2017 --auto_generate --xml_tag seg --match_text_files "train*" --n_lines 15
+    datasets-cli dummy_data ./datasets/snli --auto_generate
+    datasets-cli dummy_data ./datasets/squad --auto_generate --json_field data
+    datasets-cli dummy_data ./datasets/iwslt2017 --auto_generate --xml_tag seg --match_text_files "train*" --n_lines 15
     # --xml_tag seg => each sample corresponds to a "seg" tag in the xml tree
     # --match_text_files "train*" =>  also match text files that don't have a proper text file extension (no suffix like ".txt" for example)
     # --n_lines 15 => some text files have headers so we have to use at least 15 lines
@@ -489,7 +489,7 @@ If all tests pass, your dataset works correctly. Awesome! You can now follow the
 
 .. code-block::
 
-        python datasets-cli dummy_data datasets/<your-dataset-folder>
+        datasets-cli dummy_data datasets/<your-dataset-folder>
 
 and make sure you follow the exact instructions provided by the command.
 

diff --git a/setup.py b/setup.py
@@ -204,7 +204,7 @@
             "scripts/templates/*",
         ],
     },
-    scripts=["datasets-cli"],
+    entry_points={"console_scripts": ["datasets-cli=datasets.commands.datasets_cli:main"]},
     install_requires=REQUIRED_PKGS,
     extras_require=EXTRAS_REQUIRE,
     classifiers=[

diff --git a/datasets-cli → src/datasets/commands/datasets_cli.py b/datasets-cli → src/datasets/commands/datasets_cli.py
@@ -3,15 +3,16 @@
 
 from datasets.commands.convert import ConvertCommand
 from datasets.commands.download import DownloadCommand
+from datasets.commands.dummy_data import DummyDataCommand
 from datasets.commands.env import EnvironmentCommand
-from datasets.commands.test import TestCommand
 from datasets.commands.run_beam import RunBeamCommand
-from datasets.commands.dummy_data import DummyDataCommand
+from datasets.commands.test import TestCommand
 from datasets.utils.logging import set_verbosity_info
 
-if __name__ == '__main__':
-    parser = ArgumentParser('HuggingFace Datasets CLI tool', usage='datasets-cli <command> [<args>]')
-    commands_parser = parser.add_subparsers(help='datasets-cli command helpers')
+
+def main():
+    parser = ArgumentParser("HuggingFace Datasets CLI tool", usage="datasets-cli <command> [<args>]")
+    commands_parser = parser.add_subparsers(help="datasets-cli command helpers")
     set_verbosity_info()
 
     # Register commands
@@ -25,10 +26,14 @@
     # Let's go
     args = parser.parse_args()
 
-    if not hasattr(args, 'func'):
+    if not hasattr(args, "func"):
         parser.print_help()
         exit(1)
 
     # Run
     service = args.func(args)
     service.run()
+
+
+if __name__ == "__main__":
+    main()