Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs about the CLI #6831

Merged
merged 7 commits into from
Apr 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/source/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,8 @@
title: Cloud storage
- local: faiss_es
title: Search index
- local: cli
title: CLI
- local: how_to_metrics
title: Metrics
- local: beam
Expand Down
53 changes: 53 additions & 0 deletions docs/source/cli.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Command Line Interface (CLI)

🤗 Datasets provides a command line interface (CLI) with useful shell commands to interact with your dataset.

You can check the available commands:
```bash
>>> datasets-cli --help
usage: datasets-cli <command> [<args>]

positional arguments:
{convert,env,test,run_beam,dummy_data,convert_to_parquet}
datasets-cli command helpers
convert Convert a TensorFlow Datasets dataset to a HuggingFace Datasets dataset.
env Print relevant system environment info.
test Test dataset implementation.
run_beam Run a Beam dataset processing pipeline
dummy_data Generate dummy data.
convert_to_parquet Convert dataset to Parquet

optional arguments:
-h, --help show this help message and exit
```

## Convert to Parquet

Easily convert your Hub [script-based dataset](dataset_script) to Parquet [data-only dataset](repository_structure), so that the dataset viewer will be supported.

```bash
>>> datasets-cli convert_to_parquet --help
usage: datasets-cli <command> [<args>] convert_to_parquet [-h] [--token TOKEN] [--revision REVISION] [--trust_remote_code] dataset_id

positional arguments:
dataset_id source dataset ID, e.g. USERNAME/DATASET_NAME or ORGANIZATION/DATASET_NAME

optional arguments:
-h, --help show this help message and exit
--token TOKEN access token to the Hugging Face Hub
--revision REVISION source revision
--trust_remote_code whether to trust the code execution of the load script
```

This command:
- makes a copy of the script on the "main" branch into a dedicated branch called "script" (if it does not already exists)
- creates a pull request to the Hub dataset to convert it to Parquet files (and deletes the script from the main branch)

If in the future you need to recreate the Parquet files from the "script" branch, pass the `--revision script` argument.

Note that you should pass the `--trust_remote_code` argument only if you trust the remote code to be executed locally on your machine.

For example:
```bash
>>> datasets-cli convert_to_parquet USERNAME/DATASET_NAME --token USER_ACCESS_TOKEN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
>>> datasets-cli convert_to_parquet USERNAME/DATASET_NAME --token USER_ACCESS_TOKEN
>>> datasets-cli convert_to_parquet USERNAME/DATASET_NAME

I would not advertise the --token arg in the example as this shouldn't be the recommended way (best to login with env variable or huggingface-cli login)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comment @Wauplin. I am fixing that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

```
4 changes: 3 additions & 1 deletion src/datasets/commands/convert_to_parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,9 @@ class ConvertToParquetCommand(BaseDatasetsCLICommand):
@staticmethod
def register_subcommand(parser):
parser: ArgumentParser = parser.add_parser("convert_to_parquet", help="Convert dataset to Parquet")
parser.add_argument("dataset_id", help="source dataset ID")
parser.add_argument(
"dataset_id", help="source dataset ID, e.g. USERNAME/DATASET_NAME or ORGANIZATION/DATASET_NAME"
)
parser.add_argument("--token", help="access token to the Hugging Face Hub")
parser.add_argument("--revision", help="source revision")
parser.add_argument(
Expand Down
Loading