docmunent usage of hfh cli instead of git

huggingface · lhoestq · Feb 8, 2024 · Feb 8, 2024 · Feb 8, 2024 · Feb 8, 2024
commit 0b4cfd693094a28dae3b08dc67fd7f2febde2a66
diff --git a/docs/source/share.mdx b/docs/source/share.mdx
@@ -9,8 +9,9 @@ Dataset repositories offer features such as:
 - Commit history and diffs
 - Metadata for discoverability
 - Dataset cards for documentation, licensing, limitations, etc.
+- [Dataset Viewer](../hub/datasets-viewer)
 
-This guide will show you how to share a dataset that can be easily accessed by anyone.
+This guide will show you how to share a dataset folder or repository that can be easily accessed by anyone.
 
 <a id='upload_dataset_repo'></a>
 
@@ -20,11 +21,12 @@ You can share your dataset with the community with a dataset repository on the H
 It can also be a private dataset if you want to control who has access to it.
 
 In a dataset repository, you can host all your data files and [configure your dataset](./repository_structure#define-your-splits-in-yaml) to define which file goes to which split.
-The following formats are supported: CSV, TSV, JSON, JSON lines, text, Parquet, Arrow, SQLite.
+The following formats are supported: CSV, TSV, JSON, JSON lines, text, Parquet, Arrow, SQLite, WebDataset.
 Many kinds of compressed file types are also supported: GZ, BZ2, LZ4, LZMA or ZSTD.
 For example, your dataset can be made of `.json.gz` files.
 
 On the other hand, if your dataset is not in a supported format or if you want more control over how your dataset is loaded, you can write your own dataset script.
+Note that some feature are not available for datasets defined using a loading scripts, such as the Dataset Viewer. Users also have to pass `trust_remote_code=True` to load the dataset. It is generally recommended for datasets to not rely on a loading script if possible, to benefit from all the Hub's features.
 
 When loading a dataset from the Hub, all the files in the supported formats are loaded, following the [repository structure](./repository_structure).
 However if there's a dataset script, it is downloaded and executed to download and prepare the dataset instead.
@@ -45,87 +47,173 @@ huggingface-cli login
 2. Login using your Hugging Face Hub credentials, and create a new dataset repository:
 
 ```
-huggingface-cli repo create your_dataset_name --type dataset
+huggingface-cli repo create my-cool-dataset --type dataset
 ```
 
 Add the `-organization` flag to create a repository under a specific organization:
 
 ```
-huggingface-cli repo create your_dataset_name --type dataset --organization your-org-name
+huggingface-cli repo create my-cool-dataset --type dataset --organization your-org-name
 ```
 
-### Clone the repository
+## Prepare your files
 
-3. Install [Git LFS](https://git-lfs.github.com/) and clone your repository (refer to the [Git over SSH docs](https://huggingface.co/docs/hub/security-git-ssh) if you prefer cloning through SSH):
+Check your directory to ensure the only files you're uploading are:
 
+- The data files of the dataset
+
+- The dataset card `README.md`
+
+- (optional) `your_dataset_name.py` is your dataset loading script (optional if your data files are already in the supported formats csv/jsonl/json/parquet/txt). To create a dataset script, see the [dataset script](dataset_script) page. Note that some feature are not available for datasets defined using a loading scripts, such as the Dataset Viewer. Users also have to pass `trust_remote_code=True` to load the dataset. It is generally recommended for datasets to not rely on a loading script if possible, to benefit from all the Hub's features.
+
+## huggingface-cli upload
+
+Use the `huggingface-cli upload` command to upload files to the Hub directly. Internally, it uses the same [`upload_file`] and [`upload_folder`] helpers described in the [Upload guide](../huggingface_hub/guides/upload). In the examples below, we will walk through the most common use cases. For a full list of available options, you can run:
+
+```bash
+>>> huggingface-cli upload --help
 ```
-# Make sure you have git-lfs installed
-# (https://git-lfs.github.com/)
-git lfs install
 
-git clone https://huggingface.co/datasets/namespace/your_dataset_name
+For more general information about `huggingface-cli` you can check the [CLI guide](../huggingface_hub/guides/cli).
+
+### Upload an entire folder
+
+The default usage for this command is:
+
+```bash
+# Usage:  huggingface-cli upload [dataset_repo_id] --repo-type=dataset [local_path] [path_in_repo]
 ```
 
-Here the `namespace` is either your username or your organization name.
+To upload the current directory at the root of the repo, use:
 
-### Prepare your files
+```bash
+>>> huggingface-cli upload my-cool-dataset --repo-type=dataset . .
+https://huggingface.co/datasets/Wauplin/my-cool-dataset/tree/main/
+```
 
-4. Now is a good time to check your directory to ensure the only files you're uploading are:
+<Tip>
 
-- The data files of the dataset
+If the repo doesn't exist yet, it will be created automatically.
 
-- The dataset card `README.md`
+</Tip>
+
+You can also upload a specific folder:
 
-- (optional) `your_dataset_name.py` is your dataset loading script (optional if your data files are already in the supported formats csv/jsonl/json/parquet/txt). To create a dataset script, see the [dataset script](dataset_script) page.
+```bash
+>>> huggingface-cli upload my-cool-dataset --repo-type=dataset ./data .
+https://huggingface.co/datasetsWauplin/my-cool-dataset/tree/main/
+```
+
+Finally, you can upload a folder to a specific destination on the repo:
 
-### Upload your files
+```bash
+>>> huggingface-cli upload my-cool-dataset --repo-type=dataset ./path/to/curated/data /data/train
+https://huggingface.co/datasetsWauplin/my-cool-dataset/tree/main/data/train
+```
 
-You can directly upload your files to your repository on the Hugging Face Hub, but this guide will show you how to upload the files from the terminal.
+### Upload a single file
 
-5. It is important to add the large data files first with `git lfs track` or else you will encounter an error later when you push your files:
+You can also upload a single file by setting `local_path` to point to a file on your machine. If that's the case, `path_in_repo` is optional and will default to the name of your local file:
 
+```bash
+>>> huggingface-cli upload Wauplin/my-cool-dataset --repo-type=dataset ./files/train.csv
+https://huggingface.co/datasetsWauplin/my-cool-dataset/blob/main/train.csv
 ```
-cp /somewhere/data/*.json .
-git lfs track *.json
-git add .gitattributes
-git add *.json
-git commit -m "add json files"
+
+If you want to upload a single file to a specific directory, set `path_in_repo` accordingly:
+
+```bash
+>>> huggingface-cli upload Wauplin/my-cool-dataset --repo-type=dataset ./files/train.csv /data/train.csv
+https://huggingface.co/datasetsWauplin/my-cool-dataset/blob/main/data/train.csv
 ```
 
-6. (Optional) Add the dataset loading script:
+### Upload multiple files
+
+To upload multiple files from a folder at once without uploading the entire folder, use the `--include` and `--exclude` patterns. It can also be combined with the `--delete` option to delete files on the repo while uploading new ones. In the example below, we sync the local Space by deleting remote files and uploading all CSV files:
 
+```bash
+# Sync local Space with Hub (upload new CSV files, delete removed files)
+>>> huggingface-cli upload Wauplin/my-cool-dataset --repo-type=dataset --include="/data/*.csv" --delete="*" --commit-message="Sync local dataset with Hub"
+...
 ```
-cp /somewhere/data/load_script.py .
-git add --all
+
+### Upload to an organization
+
+To upload content to a repo owned by an organization instead of a personal repo, you must explicitly specify it in the `repo_id`:
+
+```bash
+>>> huggingface-cli upload MyCoolOrganization/my-cool-dataset --repo-type=dataset . .
+https://huggingface.co/datasetsMyCoolOrganization/my-cool-dataset/tree/main/
 ```
 
-7. Verify the files have been correctly staged. Then you can commit and push your files:
+### Upload to a specific revision
 
+By default, files are uploaded to the `main` branch. If you want to upload files to another branch or reference, use the `--revision` option:
+
+```bash
+# Upload files to a PR
+>>> huggingface-cli upload bigcode/the-stack --repo-type dataset -revision refs/pr/104 . .
+...
 ```
-git status
-git commit -m "First version of the your_dataset_name dataset."
-git push
+
+**Note:** if `revision` does not exist and `--create-pr` is not set, a branch will be created automatically from the `main` branch.
+
+### Upload and create a PR
+
+If you don't have the permission to push to a repo, you must open a PR and let the authors know about the changes you want to make. This can be done by setting the `--create-pr` option:
+
+```bash
+# Create a PR and upload the files to it
+>>> huggingface-cli upload bigcode/the-stack --repo-type dataset --revision refs/pr/104 --create-pr . .
+https://huggingface.co/datasets/bigcode/the-stack/blob/refs%2Fpr%2F104/
 ```
 
-Congratulations, your dataset has now been uploaded to the Hugging Face Hub where anyone can load it in a single line of code! 🥳
+### Upload at regular intervals
 
+In some cases, you might want to push regular updates to a repo. For example, this is useful if your dataset is growing over time and you want to upload the data folder every 10 minutes. You can do this using the `--every` option:
+
+```bash
+# Upload new logs every 10 minutes
+huggingface-cli upload my-cool-dynamic-dataset data/ --every=10
 ```
-dataset = load_dataset("namespace/your_dataset_name")
+
+### Specify a commit message
+
+Use the `--commit-message` and `--commit-description` to set a custom message and description for your commit instead of the default one
+
+```bash
+>>> huggingface-cli upload Wauplin/my-cool-dataset ./data . --repo-type dataset --commit-message="Version 2" --commit-description="Train size: 4321. Check Dataset Viewer for more details."
+...
+https://huggingface.co/datasetsWauplin/my-cool-dataset/tree/main
 ```
 
-Finally, don't forget to enrich the dataset card to document your dataset and make it discoverable! Check out the [Create a dataset card](dataset_card) guide to learn more.
+### Specify a token
 
-## Datasets on GitHub (legacy)
+To upload files, you must use a token. By default, the token saved locally (using `huggingface-cli login`) will be used. If you want to authenticate explicitly, use the `--token` option:
 
-Datasets used to be hosted on our GitHub repository, but all datasets have now been migrated to the Hugging Face Hub.
+```bash
+>>> huggingface-cli upload Wauplin/my-cool-dataset ./data . --token=hf_****
+...
+https://huggingface.co/datasetsWauplin/my-cool-data/tree/main
+```
 
-The legacy GitHub datasets were added originally on our GitHub repository and therefore don't have a namespace on the Hub: "squad", "glue", etc. unlike the other datasets that are named "username/dataset_name" or "org/dataset_name".
+### Quiet mode
 
-<Tip>
+By default, the `huggingface-cli upload` command will be verbose. It will print details such as warning messages, information about the uploaded files, and progress bars. If you want to silence all of this, use the `--quiet` option. Only the last line (i.e. the URL to the uploaded files) is printed. This can prove useful if you want to pass the output to another command in a script.
+
+```bash
+>>> huggingface-cli upload Wauplin/my-cool-dataset ./data . --quiet
+https://huggingface.co/datasets/Wauplin/my-cool-dataset/tree/main
+```
 
-The distinction between a Hub dataset within or without a namespace only comes from the legacy sharing workflow. It does not involve any ranking, decisioning, or opinion regarding the contents of the dataset itself.
+## Enjoy !
 
-</Tip>
+Congratulations, your dataset has now been uploaded to the Hugging Face Hub where anyone can load it in a single line of code! 🥳
 
-Those datasets are now maintained on the Hub: if you think a fix is needed, please use their "Community" tab to open a discussion or create a Pull Request.
-The code of these datasets is reviewed by the Hugging Face team.
+```
+dataset = load_dataset("namespace/your_dataset_name")
+```
+
+If your dataset is supported, it should also have a [Dataset Viewer](../hub/datasets-viewer) for everyone to explore the dataset content.
+
+Finally, don't forget to enrich the dataset card to document your dataset and make it discoverable! Check out the [Create a dataset card](dataset_card) guide to learn more.