Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document usage of hfh cli instead of git #6648

Merged
merged 2 commits into from
Feb 8, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
docmunent usage of hfh cli instead of git
  • Loading branch information
lhoestq committed Feb 8, 2024
commit 0b4cfd693094a28dae3b08dc67fd7f2febde2a66
172 changes: 130 additions & 42 deletions docs/source/share.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,9 @@ Dataset repositories offer features such as:
- Commit history and diffs
- Metadata for discoverability
- Dataset cards for documentation, licensing, limitations, etc.
- [Dataset Viewer](../hub/datasets-viewer)

This guide will show you how to share a dataset that can be easily accessed by anyone.
This guide will show you how to share a dataset folder or repository that can be easily accessed by anyone.

<a id='upload_dataset_repo'></a>

Expand All @@ -20,11 +21,12 @@ You can share your dataset with the community with a dataset repository on the H
It can also be a private dataset if you want to control who has access to it.

In a dataset repository, you can host all your data files and [configure your dataset](./repository_structure#define-your-splits-in-yaml) to define which file goes to which split.
The following formats are supported: CSV, TSV, JSON, JSON lines, text, Parquet, Arrow, SQLite.
The following formats are supported: CSV, TSV, JSON, JSON lines, text, Parquet, Arrow, SQLite, WebDataset.
Many kinds of compressed file types are also supported: GZ, BZ2, LZ4, LZMA or ZSTD.
For example, your dataset can be made of `.json.gz` files.

On the other hand, if your dataset is not in a supported format or if you want more control over how your dataset is loaded, you can write your own dataset script.
Note that some feature are not available for datasets defined using a loading scripts, such as the Dataset Viewer. Users also have to pass `trust_remote_code=True` to load the dataset. It is generally recommended for datasets to not rely on a loading script if possible, to benefit from all the Hub's features.

When loading a dataset from the Hub, all the files in the supported formats are loaded, following the [repository structure](./repository_structure).
However if there's a dataset script, it is downloaded and executed to download and prepare the dataset instead.
Expand All @@ -45,87 +47,173 @@ huggingface-cli login
2. Login using your Hugging Face Hub credentials, and create a new dataset repository:

```
huggingface-cli repo create your_dataset_name --type dataset
huggingface-cli repo create my-cool-dataset --type dataset
```

Add the `-organization` flag to create a repository under a specific organization:

```
huggingface-cli repo create your_dataset_name --type dataset --organization your-org-name
huggingface-cli repo create my-cool-dataset --type dataset --organization your-org-name
```

### Clone the repository
## Prepare your files

3. Install [Git LFS](https://git-lfs.github.com/) and clone your repository (refer to the [Git over SSH docs](https://huggingface.co/docs/hub/security-git-ssh) if you prefer cloning through SSH):
Check your directory to ensure the only files you're uploading are:

- The data files of the dataset

- The dataset card `README.md`

- (optional) `your_dataset_name.py` is your dataset loading script (optional if your data files are already in the supported formats csv/jsonl/json/parquet/txt). To create a dataset script, see the [dataset script](dataset_script) page. Note that some feature are not available for datasets defined using a loading scripts, such as the Dataset Viewer. Users also have to pass `trust_remote_code=True` to load the dataset. It is generally recommended for datasets to not rely on a loading script if possible, to benefit from all the Hub's features.

## huggingface-cli upload

Use the `huggingface-cli upload` command to upload files to the Hub directly. Internally, it uses the same [`upload_file`] and [`upload_folder`] helpers described in the [Upload guide](../huggingface_hub/guides/upload). In the examples below, we will walk through the most common use cases. For a full list of available options, you can run:

```bash
>>> huggingface-cli upload --help
```
# Make sure you have git-lfs installed
# (https://git-lfs.github.com/)
git lfs install

git clone https://huggingface.co/datasets/namespace/your_dataset_name
For more general information about `huggingface-cli` you can check the [CLI guide](../huggingface_hub/guides/cli).

### Upload an entire folder

The default usage for this command is:

```bash
# Usage: huggingface-cli upload [dataset_repo_id] --repo-type=dataset [local_path] [path_in_repo]
```

Here the `namespace` is either your username or your organization name.
To upload the current directory at the root of the repo, use:

### Prepare your files
```bash
>>> huggingface-cli upload my-cool-dataset --repo-type=dataset . .
https://huggingface.co/datasets/Wauplin/my-cool-dataset/tree/main/
```

4. Now is a good time to check your directory to ensure the only files you're uploading are:
<Tip>

- The data files of the dataset
If the repo doesn't exist yet, it will be created automatically.

- The dataset card `README.md`
</Tip>

You can also upload a specific folder:

- (optional) `your_dataset_name.py` is your dataset loading script (optional if your data files are already in the supported formats csv/jsonl/json/parquet/txt). To create a dataset script, see the [dataset script](dataset_script) page.
```bash
>>> huggingface-cli upload my-cool-dataset --repo-type=dataset ./data .
https://huggingface.co/datasetsWauplin/my-cool-dataset/tree/main/
```

Finally, you can upload a folder to a specific destination on the repo:

### Upload your files
```bash
>>> huggingface-cli upload my-cool-dataset --repo-type=dataset ./path/to/curated/data /data/train
https://huggingface.co/datasetsWauplin/my-cool-dataset/tree/main/data/train
```

You can directly upload your files to your repository on the Hugging Face Hub, but this guide will show you how to upload the files from the terminal.
### Upload a single file

5. It is important to add the large data files first with `git lfs track` or else you will encounter an error later when you push your files:
You can also upload a single file by setting `local_path` to point to a file on your machine. If that's the case, `path_in_repo` is optional and will default to the name of your local file:

```bash
>>> huggingface-cli upload Wauplin/my-cool-dataset --repo-type=dataset ./files/train.csv
https://huggingface.co/datasetsWauplin/my-cool-dataset/blob/main/train.csv
```
cp /somewhere/data/*.json .
git lfs track *.json
git add .gitattributes
git add *.json
git commit -m "add json files"

If you want to upload a single file to a specific directory, set `path_in_repo` accordingly:

```bash
>>> huggingface-cli upload Wauplin/my-cool-dataset --repo-type=dataset ./files/train.csv /data/train.csv
https://huggingface.co/datasetsWauplin/my-cool-dataset/blob/main/data/train.csv
```

6. (Optional) Add the dataset loading script:
### Upload multiple files

To upload multiple files from a folder at once without uploading the entire folder, use the `--include` and `--exclude` patterns. It can also be combined with the `--delete` option to delete files on the repo while uploading new ones. In the example below, we sync the local Space by deleting remote files and uploading all CSV files:

```bash
# Sync local Space with Hub (upload new CSV files, delete removed files)
>>> huggingface-cli upload Wauplin/my-cool-dataset --repo-type=dataset --include="/data/*.csv" --delete="*" --commit-message="Sync local dataset with Hub"
...
```
cp /somewhere/data/load_script.py .
git add --all

### Upload to an organization

To upload content to a repo owned by an organization instead of a personal repo, you must explicitly specify it in the `repo_id`:

```bash
>>> huggingface-cli upload MyCoolOrganization/my-cool-dataset --repo-type=dataset . .
https://huggingface.co/datasetsMyCoolOrganization/my-cool-dataset/tree/main/
```

7. Verify the files have been correctly staged. Then you can commit and push your files:
### Upload to a specific revision

By default, files are uploaded to the `main` branch. If you want to upload files to another branch or reference, use the `--revision` option:

```bash
# Upload files to a PR
>>> huggingface-cli upload bigcode/the-stack --repo-type dataset -revision refs/pr/104 . .
...
```
git status
git commit -m "First version of the your_dataset_name dataset."
git push

**Note:** if `revision` does not exist and `--create-pr` is not set, a branch will be created automatically from the `main` branch.

### Upload and create a PR

If you don't have the permission to push to a repo, you must open a PR and let the authors know about the changes you want to make. This can be done by setting the `--create-pr` option:

```bash
# Create a PR and upload the files to it
>>> huggingface-cli upload bigcode/the-stack --repo-type dataset --revision refs/pr/104 --create-pr . .
https://huggingface.co/datasets/bigcode/the-stack/blob/refs%2Fpr%2F104/
```

Congratulations, your dataset has now been uploaded to the Hugging Face Hub where anyone can load it in a single line of code! 🥳
### Upload at regular intervals

In some cases, you might want to push regular updates to a repo. For example, this is useful if your dataset is growing over time and you want to upload the data folder every 10 minutes. You can do this using the `--every` option:

```bash
# Upload new logs every 10 minutes
huggingface-cli upload my-cool-dynamic-dataset data/ --every=10
```
dataset = load_dataset("namespace/your_dataset_name")

### Specify a commit message

Use the `--commit-message` and `--commit-description` to set a custom message and description for your commit instead of the default one

```bash
>>> huggingface-cli upload Wauplin/my-cool-dataset ./data . --repo-type dataset --commit-message="Version 2" --commit-description="Train size: 4321. Check Dataset Viewer for more details."
...
https://huggingface.co/datasetsWauplin/my-cool-dataset/tree/main
```

Finally, don't forget to enrich the dataset card to document your dataset and make it discoverable! Check out the [Create a dataset card](dataset_card) guide to learn more.
### Specify a token

## Datasets on GitHub (legacy)
To upload files, you must use a token. By default, the token saved locally (using `huggingface-cli login`) will be used. If you want to authenticate explicitly, use the `--token` option:

Datasets used to be hosted on our GitHub repository, but all datasets have now been migrated to the Hugging Face Hub.
```bash
>>> huggingface-cli upload Wauplin/my-cool-dataset ./data . --token=hf_****
...
https://huggingface.co/datasetsWauplin/my-cool-data/tree/main
```

The legacy GitHub datasets were added originally on our GitHub repository and therefore don't have a namespace on the Hub: "squad", "glue", etc. unlike the other datasets that are named "username/dataset_name" or "org/dataset_name".
### Quiet mode

<Tip>
By default, the `huggingface-cli upload` command will be verbose. It will print details such as warning messages, information about the uploaded files, and progress bars. If you want to silence all of this, use the `--quiet` option. Only the last line (i.e. the URL to the uploaded files) is printed. This can prove useful if you want to pass the output to another command in a script.

```bash
>>> huggingface-cli upload Wauplin/my-cool-dataset ./data . --quiet
https://huggingface.co/datasets/Wauplin/my-cool-dataset/tree/main
```

The distinction between a Hub dataset within or without a namespace only comes from the legacy sharing workflow. It does not involve any ranking, decisioning, or opinion regarding the contents of the dataset itself.
## Enjoy !

</Tip>
Congratulations, your dataset has now been uploaded to the Hugging Face Hub where anyone can load it in a single line of code! 🥳

Those datasets are now maintained on the Hub: if you think a fix is needed, please use their "Community" tab to open a discussion or create a Pull Request.
The code of these datasets is reviewed by the Hugging Face team.
```
dataset = load_dataset("namespace/your_dataset_name")
```

If your dataset is supported, it should also have a [Dataset Viewer](../hub/datasets-viewer) for everyone to explore the dataset content.

Finally, don't forget to enrich the dataset card to document your dataset and make it discoverable! Check out the [Create a dataset card](dataset_card) guide to learn more.
Loading