Skip to content

Commit

Permalink
less script docs (#6993)
Browse files Browse the repository at this point in the history
  • Loading branch information
lhoestq authored Jun 27, 2024
1 parent b275462 commit 6cf563f
Show file tree
Hide file tree
Showing 8 changed files with 23 additions and 53 deletions.
4 changes: 1 addition & 3 deletions docs/source/audio_dataset.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,6 @@ There are several methods for creating and sharing an audio dataset:

* Create an audio dataset repository with the `AudioFolder` builder. This is a no-code solution for quickly creating an audio dataset with several thousand audio files.

* Create an audio dataset by writing a loading script. This method is for advanced users and requires more effort and coding, but you have greater flexibility over how a dataset is defined, downloaded, and generated which can be useful for more complex or large scale audio datasets.


<Tip>

Expand Down Expand Up @@ -175,7 +173,7 @@ Some audio datasets, like those found in [Kaggle competitions](https://www.kaggl

</Tip>

## Loading script
## (Legacy) Loading script

Write a dataset loading script to manually create a dataset.
It defines a dataset's splits and configurations, and handles downloading and generating the dataset examples.
Expand Down
2 changes: 1 addition & 1 deletion docs/source/create_dataset.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ You can also create a dataset from local files by specifying the path to the dat

## Next steps

We didn't mention this in the tutorial, but you can also create a dataset with a loading script. A loading script is a more manual and code-intensive method for creating a dataset, but it also gives you the most flexibility and control over how a dataset is generated. It lets you configure additional options such as creating multiple configurations within a dataset, or enabling your dataset to be streamed.
We didn't mention this in the tutorial, but you can also create a dataset with a loading script. A loading script is a more manual and code-intensive method for creating a dataset, and are not well supported on Hugging Face. Though in some rare cases it can still be helpful.

To learn more about how to write loading scripts, take a look at the <a href="https://huggingface.co/docs/datasets/main/en/image_dataset#loading-script"><span class="underline decoration-yellow-400 decoration-2 font-semibold">image loading script</span></a>, <a href="https://huggingface.co/docs/datasets/main/en/audio_dataset"><span class="underline decoration-pink-400 decoration-2 font-semibold">audio loading script</span></a>, and <a href="https://huggingface.co/docs/datasets/main/en/dataset_script"><span class="underline decoration-green-400 decoration-2 font-semibold">text loading script</span></a> guides.

Expand Down
8 changes: 0 additions & 8 deletions docs/source/filesystems.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -142,14 +142,6 @@ Load a dataset builder from the Hugging Face Hub (see [how to load from the Hugg
>>> builder.download_and_prepare(output_dir, storage_options=storage_options, file_format="parquet")
```

Load a dataset builder using a loading script (see [how to load a local loading script](./loading#local-loading-script)):

```py
>>> output_dir = "s3://my-bucket/imdb"
>>> builder = load_dataset_builder("path/to/local/loading_script/loading_script.py")
>>> builder.download_and_prepare(output_dir, storage_options=storage_options, file_format="parquet")
```

Use your own data files (see [how to load local and remote files](./loading#local-and-remote-files)):

```py
Expand Down
5 changes: 3 additions & 2 deletions docs/source/image_dataset.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,9 @@

There are two methods for creating and sharing an image dataset. This guide will show you how to:

* Create an audio dataset from local files in python with [`Dataset.push_to_hub`]. This is an easy way that requires only a few steps in python.

* Create an image dataset with `ImageFolder` and some metadata. This is a no-code solution for quickly creating an image dataset with several thousand images.
* Create an image dataset by writing a loading script. This method is a bit more involved, but you have greater flexibility over how a dataset is defined, downloaded, and generated which can be useful for more complex or large scale image datasets.

<Tip>

Expand Down Expand Up @@ -188,7 +189,7 @@ Load your WebDataset and it will create on column per file suffix (here "jpg" an
{"bbox": [[302.0, 109.0, 73.0, 52.0]], "categories": [0]}
```

## Loading script
## (Legacy) Loading script

Write a dataset loading script to share a dataset. It defines a dataset's splits and configurations, and handles downloading and generating a dataset. The script is located in the same folder or repository as the dataset and should have the same name.

Expand Down
6 changes: 1 addition & 5 deletions docs/source/load_hub.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -102,11 +102,7 @@ Then load the configuration you want:

## Remote code

Certain datasets repositories contain a loading script with the Python code used to generate the dataset.
Those datasets are generally exported to Parquet by Hugging Face, so that 🤗 Datasets can load the dataset fast and without running a loading script.

Even if a Parquet export is not available, you can still use any dataset with Python code in its repository with `load_dataset`.
All files and code uploaded to the Hub are scanned for malware (refer to the Hub security documentation for more information), but you should still review the dataset loading scripts and authors to avoid executing malicious code on your machine. You should set `trust_remote_code=True` to use a dataset with a loading script, or you will get an error:
Certain datasets repositories contain a loading script with the Python code used to generate the dataset. All files and code uploaded to the Hub are scanned for malware (refer to the Hub security documentation for more information), but you should still review the dataset loading scripts and authors to avoid executing malicious code on your machine. You should set `trust_remote_code=True` to use a dataset with a loading script, or you will get an error:

```py
>>> from datasets import get_dataset_config_names, get_dataset_split_names, load_dataset
Expand Down
47 changes: 16 additions & 31 deletions docs/source/loading.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,12 @@ Your data can be stored in various places; they can be on your local machine's d

This guide will show you how to load a dataset from:

- The Hub without a dataset loading script
- Local loading script
- The Hugging Face Hub
- Local files
- In-memory data
- Offline
- A specific slice of a split
- Local loading script (legacy)

For more details specific to loading other dataset modalities, take a look at the <a class="underline decoration-pink-400 decoration-2 font-semibold" href="./audio_load">load audio dataset guide</a>, the <a class="underline decoration-yellow-400 decoration-2 font-semibold" href="./image_load">load image dataset guide</a>, or the <a class="underline decoration-green-400 decoration-2 font-semibold" href="./nlp_load">load text dataset guide</a>.

Expand Down Expand Up @@ -73,35 +73,6 @@ The `split` parameter can also map a data file to a specific split:
>>> c4_validation = load_dataset("allenai/c4", data_files=data_files, split="validation")
```

## Local loading script

You may have a 🤗 Datasets loading script locally on your computer. In this case, load the dataset by passing one of the following paths to [`load_dataset`]:

- The local path to the loading script file.
- The local path to the directory containing the loading script file (only if the script file has the same name as the directory).

Pass `trust_remote_code=True` to allow 🤗 Datasets to execute the loading script:

```py
>>> dataset = load_dataset("path/to/local/loading_script/loading_script.py", split="train", trust_remote_code=True)
>>> dataset = load_dataset("path/to/local/loading_script", split="train", trust_remote_code=True) # equivalent because the file has the same name as the directory
```

### Edit loading script

You can also edit a loading script from the Hub to add your own modifications. Download the dataset repository locally so any data files referenced by a relative path in the loading script can be loaded:

```bash
git clone https://huggingface.co/datasets/eli5
```

Make your edits to the loading script and then load it by passing its local path to [`~datasets.load_dataset`]:

```py
>>> from datasets import load_dataset
>>> eli5 = load_dataset("path/to/local/eli5")
```

## Local and remote files

Datasets can be loaded from local files stored on your computer and from remote files. The datasets are most likely stored as a `csv`, `json`, `txt` or `parquet` file. The [`load_dataset`] function can load each of these file types.
Expand Down Expand Up @@ -534,3 +505,17 @@ In some instances, you may be simultaneously running multiple independent distri
>>> from datasets import load_metric
>>> metric = load_metric('glue', 'mrpc', num_process=num_process, process_id=process_id, experiment_id="My_experiment_10")
```

## (Legacy) Local loading script

You may have a 🤗 Datasets loading script locally on your computer. In this case, load the dataset by passing one of the following paths to [`load_dataset`]:

- The local path to the loading script file.
- The local path to the directory containing the loading script file (only if the script file has the same name as the directory).

Pass `trust_remote_code=True` to allow 🤗 Datasets to execute the loading script:

```py
>>> dataset = load_dataset("path/to/local/loading_script/loading_script.py", split="train", trust_remote_code=True)
>>> dataset = load_dataset("path/to/local/loading_script", split="train", trust_remote_code=True) # equivalent because the file has the same name as the directory
```
2 changes: 0 additions & 2 deletions docs/source/repository_structure.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -277,5 +277,3 @@ my_dataset_repository/
├── shard_0.csv
└── shard_1.csv
```
For more flexibility over how to load and generate a dataset, you can also write a [dataset loading script](./dataset_script).
2 changes: 1 addition & 1 deletion docs/source/upload_dataset.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,6 @@ From here, you can go on to:

- Learn more about how to use 🤗 Datasets other functions to [process your dataset](process).
- [Stream large datasets](stream) without downloading it locally.
- [Define your dataset splits and configurations](repository_structure) or [loading script](dataset_script) and share your dataset with the community.
- [Define your dataset splits and configurations](repository_structure) and share your dataset with the community.

If you have any questions about 🤗 Datasets, feel free to join and ask the community on our [forum](https://discuss.huggingface.co/c/datasets/10).

0 comments on commit 6cf563f

Please sign in to comment.