Skip to content

Commit

Permalink
Revert "rename master to main"
Browse files Browse the repository at this point in the history
This reverts commit effe74e.
  • Loading branch information
lhoestq committed Jul 6, 2022
1 parent effe74e commit ede72d3
Show file tree
Hide file tree
Showing 24 changed files with 76 additions and 289 deletions.
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/add-dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,4 @@ assignees: ''
- **Data:** *link to the Github repository or current dataset location*
- **Motivation:** *what are some good reasons to have this dataset*

Instructions to add a new dataset can be found [here](https://github.com/huggingface/datasets/blob/main/ADD_NEW_DATASET.md).
Instructions to add a new dataset can be found [here](https://github.com/huggingface/datasets/blob/master/ADD_NEW_DATASET.md).
4 changes: 2 additions & 2 deletions .github/workflows/benchmarks.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ jobs:
dvc repro --force
git fetch --prune
dvc metrics diff --show-json main > report.json
dvc metrics diff --show-json master > report.json
python ./benchmarks/format.py report.json report.md
Expand All @@ -35,7 +35,7 @@ jobs:
dvc repro --force
git fetch --prune
dvc metrics diff --show-json main > report.json
dvc metrics diff --show-json master > report.json
python ./benchmarks/format.py report.json report.md
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/build_documentation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ name: Build documentation
on:
push:
branches:
- main
- master
- doc-builder*
- v*-release

Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/test-audio.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ name: Test audio
on:
pull_request:
branches:
- main
- master

jobs:
test:
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/update-hub-repositories.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ name: Update Hub repositories
on:
push:
branches:
- main
- master

jobs:
update-hub-repositories:
Expand Down
48 changes: 24 additions & 24 deletions ADD_NEW_DATASET.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,11 +70,11 @@ You are now ready to start the process of adding the dataset. We will create the

```bash
git fetch upstream
git rebase upstream/main
git rebase upstream/master
git checkout -b a-descriptive-name-for-my-changes
```

**Do not** work on the `main` branch.
**Do not** work on the `master` branch.

3. Create your dataset folder under `datasets/<your_dataset_name>`:

Expand All @@ -96,9 +96,9 @@ You are now ready to start the process of adding the dataset. We will create the
- Download/open the data to see how it looks like
- While you explore and read about the dataset, you can complete some sections of the dataset card (the online form or the one you have just created at `./datasets/<your_dataset_name>/README.md`). You can just copy the information you meet in your readings in the relevant sections of the dataset card (typically in `Dataset Description`, `Dataset Structure` and `Dataset Creation`).

If you need more information on a section of the dataset card, a detailed guide is in the `README_guide.md` here: https://github.com/huggingface/datasets/blob/main/templates/README_guide.md.
If you need more information on a section of the dataset card, a detailed guide is in the `README_guide.md` here: https://github.com/huggingface/datasets/blob/master/templates/README_guide.md.

There is a also a (very detailed) example here: https://github.com/huggingface/datasets/tree/main/datasets/eli5.
There is a also a (very detailed) example here: https://github.com/huggingface/datasets/tree/master/datasets/eli5.

Don't spend too much time completing the dataset card, just copy what you find when exploring the dataset documentation. If you can't find all the information it's ok. You can always spend more time completing the dataset card while we are reviewing your PR (see below) and the dataset card will be open for everybody to complete them afterwards. If you don't know what to write in a section, just leave the `[More Information Needed]` text.

Expand All @@ -109,31 +109,31 @@ Now let's get coding :-)

The dataset script is the main entry point to load and process the data. It is a python script under `datasets/<your_dataset_name>/<your_dataset_name>.py`.

There is a detailed explanation on how the library and scripts are organized [here](https://huggingface.co/docs/datasets/main/about_dataset_load.html).
There is a detailed explanation on how the library and scripts are organized [here](https://huggingface.co/docs/datasets/master/about_dataset_load.html).

Note on naming: the dataset class should be camel case, while the dataset short_name is its snake case equivalent (ex: `class BookCorpus` for the dataset `book_corpus`).

To add a new dataset, you can start from the empty template which is [in the `templates` folder](https://github.com/huggingface/datasets/blob/main/templates/new_dataset_script.py):
To add a new dataset, you can start from the empty template which is [in the `templates` folder](https://github.com/huggingface/datasets/blob/master/templates/new_dataset_script.py):

```bash
cp ./templates/new_dataset_script.py ./datasets/<your_dataset_name>/<your_dataset_name>.py
```

And then go progressively through all the `TODO` in the template 🙂. If it's your first dataset addition and you are a bit lost among the information to fill in, you can take some time to read the [detailed explanation here](https://huggingface.co/docs/datasets/main/dataset_script.html).
And then go progressively through all the `TODO` in the template 🙂. If it's your first dataset addition and you are a bit lost among the information to fill in, you can take some time to read the [detailed explanation here](https://huggingface.co/docs/datasets/master/dataset_script.html).

You can also start (or copy any part) from one of the datasets of reference listed below. The main criteria for choosing among these reference dataset is the format of the data files (JSON/JSONL/CSV/TSV/text) and whether you need or don't need several configurations (see above explanations on configurations). Feel free to reuse any parts of the following examples and adapt them to your case:

- question-answering: [squad](https://github.com/huggingface/datasets/blob/main/datasets/squad/squad.py) (original data are in json)
- natural language inference: [snli](https://github.com/huggingface/datasets/blob/main/datasets/snli/snli.py) (original data are in text files with tab separated columns)
- POS/NER: [conll2003](https://github.com/huggingface/datasets/blob/main/datasets/conll2003/conll2003.py) (original data are in text files with one token per line)
- sentiment analysis: [allocine](https://github.com/huggingface/datasets/blob/main/datasets/allocine/allocine.py) (original data are in jsonl files)
- text classification: [ag_news](https://github.com/huggingface/datasets/blob/main/datasets/ag_news/ag_news.py) (original data are in csv files)
- translation: [flores](https://github.com/huggingface/datasets/blob/main/datasets/flores/flores.py) (original data come from text files - one per language)
- summarization: [billsum](https://github.com/huggingface/datasets/blob/main/datasets/billsum/billsum.py) (original data are in json files)
- benchmark: [glue](https://github.com/huggingface/datasets/blob/main/datasets/glue/glue.py) (original data are various formats)
- multilingual: [xquad](https://github.com/huggingface/datasets/blob/main/datasets/xquad/xquad.py) (original data are in json)
- multitask: [matinf](https://github.com/huggingface/datasets/blob/main/datasets/matinf/matinf.py) (original data need to be downloaded by the user because it requires authentication)
- speech recognition: [librispeech_asr](https://github.com/huggingface/datasets/blob/main/datasets/librispeech_asr/librispeech_asr.py) (original data is in .flac format)
- question-answering: [squad](https://github.com/huggingface/datasets/blob/master/datasets/squad/squad.py) (original data are in json)
- natural language inference: [snli](https://github.com/huggingface/datasets/blob/master/datasets/snli/snli.py) (original data are in text files with tab separated columns)
- POS/NER: [conll2003](https://github.com/huggingface/datasets/blob/master/datasets/conll2003/conll2003.py) (original data are in text files with one token per line)
- sentiment analysis: [allocine](https://github.com/huggingface/datasets/blob/master/datasets/allocine/allocine.py) (original data are in jsonl files)
- text classification: [ag_news](https://github.com/huggingface/datasets/blob/master/datasets/ag_news/ag_news.py) (original data are in csv files)
- translation: [flores](https://github.com/huggingface/datasets/blob/master/datasets/flores/flores.py) (original data come from text files - one per language)
- summarization: [billsum](https://github.com/huggingface/datasets/blob/master/datasets/billsum/billsum.py) (original data are in json files)
- benchmark: [glue](https://github.com/huggingface/datasets/blob/master/datasets/glue/glue.py) (original data are various formats)
- multilingual: [xquad](https://github.com/huggingface/datasets/blob/master/datasets/xquad/xquad.py) (original data are in json)
- multitask: [matinf](https://github.com/huggingface/datasets/blob/master/datasets/matinf/matinf.py) (original data need to be downloaded by the user because it requires authentication)
- speech recognition: [librispeech_asr](https://github.com/huggingface/datasets/blob/master/datasets/librispeech_asr/librispeech_asr.py) (original data is in .flac format)

While you are developing the dataset script you can list test it by opening a python interpreter and running the script (the script is dynamically updated each time you modify it):

Expand Down Expand Up @@ -286,18 +286,18 @@ Here are the step to open the Pull-Request on the main repo.
It is a good idea to sync your copy of the code with the original
repository regularly. This way you can quickly account for changes:

- If you haven't pushed your branch yet, you can rebase on upstream/main:
- If you haven't pushed your branch yet, you can rebase on upstream/master:

```bash
git fetch upstream
git rebase upstream/main
git rebase upstream/master
```
- If you have already pushed your branch, do not rebase but merge instead:

```bash
git fetch upstream
git merge upstream/main
git merge upstream/master
```

Push the changes to your account using:
Expand Down Expand Up @@ -334,7 +334,7 @@ Creating the dataset card goes in two steps:

- **Very important as well:** On the right side of the tagging app, you will also find an expandable section called **Show Markdown Data Fields**. This gives you a starting point for the description of the fields in your dataset: you should paste it into the **Data Fields** section of the [online form](https://huggingface.co/datasets/card-creator/) (or your local README.md), then modify the description as needed. Briefly describe each of the fields and indicate if they have a default value (e.g. when there is no label). If the data has span indices, describe their attributes (character level or word level, contiguous or not, etc). If the datasets contains example IDs, state whether they have an inherent meaning, such as a mapping to other datasets or pointing to relationships between data points.

Example from the [ELI5 card](https://github.com/huggingface/datasets/tree/main/datasets/eli5#data-fields):
Example from the [ELI5 card](https://github.com/huggingface/datasets/tree/master/datasets/eli5#data-fields):

Data Fields:
- q_id: a string question identifier for each example, corresponding to its ID in the Pushshift.io Reddit submission dumps.
Expand All @@ -343,9 +343,9 @@ Creating the dataset card goes in two steps:
- title_urls: list of the extracted URLs, the nth element of the list was replaced by URL_n


- **Very nice to have but optional for now:** Complete all you can find in the dataset card using the detailed instructions for completed it which are in the `README_guide.md` here: https://github.com/huggingface/datasets/blob/main/templates/README_guide.md.
- **Very nice to have but optional for now:** Complete all you can find in the dataset card using the detailed instructions for completed it which are in the `README_guide.md` here: https://github.com/huggingface/datasets/blob/master/templates/README_guide.md.

Here is a completed example: https://github.com/huggingface/datasets/tree/main/datasets/eli5 for inspiration
Here is a completed example: https://github.com/huggingface/datasets/tree/master/datasets/eli5 for inspiration

If you don't know what to write in a field and can find it, write: `[More Information Needed]`

Expand Down
12 changes: 6 additions & 6 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ If you would like to work on any of the open Issues:
git checkout -b a-descriptive-name-for-my-changes
```

**do not** work on the `main` branch.
**do not** work on the `master` branch.

4. Set up a development environment by running the following command in a virtual environment:

Expand Down Expand Up @@ -73,7 +73,7 @@ If you would like to work on any of the open Issues:

```bash
git fetch upstream
git rebase upstream/main
git rebase upstream/master
```

Push the changes to your account using:
Expand All @@ -97,15 +97,15 @@ Improving the documentation of datasets is an ever increasing effort and we invi

If you see that a dataset card is missing information that you are in a position to provide (as an author of the dataset or as an experienced user), the best thing you can do is to open a Pull Request on the Hugging Face Hub. To to do, go to the "Files and versions" tab of the dataset page and edit the `README.md` file. We provide:

* a [template](https://github.com/huggingface/datasets/blob/main/templates/README.md)
* a [guide](https://github.com/huggingface/datasets/blob/main/templates/README_guide.md) describing what information should go into each of the paragraphs
* and if you need inspiration, we recommend looking through a [completed example](https://github.com/huggingface/datasets/blob/main/datasets/eli5/README.md)
* a [template](https://github.com/huggingface/datasets/blob/master/templates/README.md)
* a [guide](https://github.com/huggingface/datasets/blob/master/templates/README_guide.md) describing what information should go into each of the paragraphs
* and if you need inspiration, we recommend looking through a [completed example](https://github.com/huggingface/datasets/blob/master/datasets/eli5/README.md)

Note that datasets that are outside of a namespace (`squad`, `imagenet-1k`, etc.) are maintained on GitHub. In this case you have to open a Pull request on GitHub to edit the file at `datasets/<dataset-name>/README.md`.

If you are a **dataset author**... you know what to do, it is your dataset after all ;) ! We would especially appreciate if you could help us fill in information about the process of creating the dataset, and take a moment to reflect on its social impact and possible limitations if you haven't already done so in the dataset paper or in another data statement.

If you are a **user of a dataset**, the main source of information should be the dataset paper if it is available: we recommend pulling information from there into the relevant paragraphs of the template. We also eagerly welcome discussions on the [Considerations for Using the Data](https://github.com/huggingface/datasets/blob/main/templates/README_guide.md#considerations-for-using-the-data) based on existing scholarship or personal experience that would benefit the whole community.
If you are a **user of a dataset**, the main source of information should be the dataset paper if it is available: we recommend pulling information from there into the relevant paragraphs of the template. We also eagerly welcome discussions on the [Considerations for Using the Data](https://github.com/huggingface/datasets/blob/master/templates/README_guide.md#considerations-for-using-the-data) based on existing scholarship or personal experience that would benefit the whole community.

Finally, if you want more information on the how and why of dataset cards, we strongly recommend reading the foundational works [Datasheets for Datasets](https://arxiv.org/abs/1803.09010) and [Data Statements for NLP](https://www.aclweb.org/anthology/Q18-1041/).

Expand Down
16 changes: 8 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
<p align="center">
<br>
<img src="https://raw.githubusercontent.com/huggingface/datasets/main/docs/source/imgs/datasets_logo_name.jpg" width="400"/>
<img src="https://raw.githubusercontent.com/huggingface/datasets/master/docs/source/imgs/datasets_logo_name.jpg" width="400"/>
<br>
<p>
<p align="center">
<a href="https://circleci.com/gh/huggingface/datasets">
<img alt="Build" src="https://img.shields.io/circleci/build/github/huggingface/datasets/main">
<img alt="Build" src="https://img.shields.io/circleci/build/github/huggingface/datasets/master">
</a>
<a href="/huggingface/datasets/blob/main/LICENSE">
<a href="/huggingface/datasets/blob/master/LICENSE">
<img alt="GitHub" src="https://img.shields.io/github/license/huggingface/datasets.svg?color=blue">
</a>
<a href="https://huggingface.co/docs/datasets/index.html">
Expand All @@ -30,12 +30,12 @@
- **one-line dataloaders for many public datasets**: one-liners to download and pre-process any of the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) major public datasets (text datasets in 467 languages and dialects, image datasets, audio datasets, etc.) provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets). With a simple command like `squad_dataset = load_dataset("squad")`, get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX),
- **efficient data pre-processing**: simple, fast and reproducible data pre-processing for the above public datasets as well as your own local datasets in CSV/JSON/text/PNG/JPEG/etc. With simple commands like `processed_dataset = dataset.map(process_example)`, efficiently prepare the dataset for inspection and ML model evaluation and training.

[🎓 **Documentation**](https://huggingface.co/docs/datasets/) [🕹 **Colab tutorial**](https://colab.research.google.com/github/huggingface/datasets/blob/main/notebooks/Overview.ipynb)
[🎓 **Documentation**](https://huggingface.co/docs/datasets/) [🕹 **Colab tutorial**](https://colab.research.google.com/github/huggingface/datasets/blob/master/notebooks/Overview.ipynb)

[🔎 **Find a dataset in the Hub**](https://huggingface.co/datasets) [🌟 **Add a new dataset to the Hub**](https://github.com/huggingface/datasets/blob/main/ADD_NEW_DATASET.md)
[🔎 **Find a dataset in the Hub**](https://huggingface.co/datasets) [🌟 **Add a new dataset to the Hub**](https://github.com/huggingface/datasets/blob/master/ADD_NEW_DATASET.md)

<h3 align="center">
<a href="https://hf.co/course"><img src="https://raw.githubusercontent.com/huggingface/datasets/main/docs/source/imgs/course_banner.png"></a>
<a href="https://hf.co/course"><img src="https://raw.githubusercontent.com/huggingface/datasets/master/docs/source/imgs/course_banner.png"></a>
</h3>

🤗 Datasets also provides access to +40 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics.
Expand Down Expand Up @@ -127,15 +127,15 @@ For more details on using the library, check the quick start page in the documen
- etc.

Another introduction to 🤗 Datasets is the tutorial on Google Colab here:
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/datasets/blob/main/notebooks/Overview.ipynb)
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/datasets/blob/master/notebooks/Overview.ipynb)

# Add a new dataset to the Hub

We have a very detailed step-by-step guide to add a new dataset to the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) datasets already provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets).

You will find [the step-by-step guide here](https://huggingface.co/docs/datasets/share.html) to add a dataset on the Hub.

However if you prefer to add your dataset in this repository, you can find the guide [here](https://github.com/huggingface/datasets/blob/main/ADD_NEW_DATASET.md).
However if you prefer to add your dataset in this repository, you can find the guide [here](https://github.com/huggingface/datasets/blob/master/ADD_NEW_DATASET.md).

# Main differences between 🤗 Datasets and `tfds`

Expand Down
Loading

1 comment on commit ede72d3

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Show benchmarks

PyArrow==6.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009278 / 0.011353 (-0.002075) 0.003874 / 0.011008 (-0.007135) 0.029012 / 0.038508 (-0.009496) 0.039792 / 0.023109 (0.016682) 0.353517 / 0.275898 (0.077619) 0.373614 / 0.323480 (0.050134) 0.006818 / 0.007986 (-0.001168) 0.004659 / 0.004328 (0.000331) 0.007565 / 0.004250 (0.003315) 0.040843 / 0.037052 (0.003790) 0.330388 / 0.258489 (0.071899) 0.386540 / 0.293841 (0.092699) 0.036897 / 0.128546 (-0.091649) 0.011467 / 0.075646 (-0.064179) 0.286472 / 0.419271 (-0.132800) 0.060831 / 0.043533 (0.017298) 0.354210 / 0.255139 (0.099072) 0.357301 / 0.283200 (0.074101) 0.102029 / 0.141683 (-0.039654) 1.929737 / 1.452155 (0.477582) 2.086166 / 1.492716 (0.593449)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.258579 / 0.018006 (0.240573) 0.482867 / 0.000490 (0.482378) 0.006658 / 0.000200 (0.006458) 0.000409 / 0.000054 (0.000355)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.026763 / 0.037411 (-0.010648) 0.110967 / 0.014526 (0.096441) 0.137656 / 0.176557 (-0.038901) 0.170854 / 0.737135 (-0.566281) 0.124305 / 0.296338 (-0.172033)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.495506 / 0.215209 (0.280297) 5.119108 / 2.077655 (3.041453) 2.062583 / 1.504120 (0.558463) 1.862076 / 1.541195 (0.320882) 1.983570 / 1.468490 (0.515080) 0.554383 / 4.584777 (-4.030394) 5.661170 / 3.745712 (1.915458) 4.319681 / 5.269862 (-0.950181) 1.172900 / 4.565676 (-3.392776) 0.056132 / 0.424275 (-0.368143) 0.013871 / 0.007607 (0.006264) 0.615071 / 0.226044 (0.389027) 6.286393 / 2.268929 (4.017464) 2.637594 / 55.444624 (-52.807030) 2.163547 / 6.876477 (-4.712930) 2.255839 / 2.142072 (0.113767) 0.652435 / 4.805227 (-4.152792) 0.143621 / 6.500664 (-6.357043) 0.069522 / 0.075469 (-0.005947)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.959207 / 1.841788 (0.117419) 16.459880 / 8.074308 (8.385572) 31.751482 / 10.191392 (21.560090) 1.036997 / 0.680424 (0.356573) 0.617796 / 0.534201 (0.083595) 0.597665 / 0.579283 (0.018382) 0.611550 / 0.434364 (0.177186) 0.382491 / 0.540337 (-0.157846) 0.380467 / 1.386936 (-1.006469)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.009166 / 0.011353 (-0.002186) 0.004588 / 0.011008 (-0.006420) 0.033051 / 0.038508 (-0.005457) 0.039340 / 0.023109 (0.016230) 0.377456 / 0.275898 (0.101558) 0.379128 / 0.323480 (0.055648) 0.007236 / 0.007986 (-0.000749) 0.006369 / 0.004328 (0.002040) 0.007579 / 0.004250 (0.003329) 0.041861 / 0.037052 (0.004808) 0.342397 / 0.258489 (0.083908) 0.395079 / 0.293841 (0.101238) 0.037363 / 0.128546 (-0.091183) 0.011736 / 0.075646 (-0.063910) 0.294042 / 0.419271 (-0.125229) 0.059536 / 0.043533 (0.016003) 0.359440 / 0.255139 (0.104301) 0.374117 / 0.283200 (0.090918) 0.103857 / 0.141683 (-0.037825) 2.225532 / 1.452155 (0.773378) 2.218929 / 1.492716 (0.726213)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.278084 / 0.018006 (0.260078) 0.482631 / 0.000490 (0.482141) 0.004787 / 0.000200 (0.004587) 0.000117 / 0.000054 (0.000063)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.030540 / 0.037411 (-0.006871) 0.115398 / 0.014526 (0.100872) 0.136990 / 0.176557 (-0.039567) 0.182177 / 0.737135 (-0.554958) 0.130014 / 0.296338 (-0.166324)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.504206 / 0.215209 (0.288997) 4.970518 / 2.077655 (2.892863) 2.238757 / 1.504120 (0.734637) 1.970100 / 1.541195 (0.428905) 1.972331 / 1.468490 (0.503841) 0.511871 / 4.584777 (-4.072905) 5.629850 / 3.745712 (1.884138) 2.502521 / 5.269862 (-2.767341) 1.091287 / 4.565676 (-3.474390) 0.066987 / 0.424275 (-0.357288) 0.013360 / 0.007607 (0.005753) 0.636841 / 0.226044 (0.410797) 6.462021 / 2.268929 (4.193093) 2.789005 / 55.444624 (-52.655620) 2.327184 / 6.876477 (-4.549293) 2.453714 / 2.142072 (0.311642) 0.680986 / 4.805227 (-4.124241) 0.151534 / 6.500664 (-6.349130) 0.072953 / 0.075469 (-0.002516)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 1.947621 / 1.841788 (0.105833) 16.365816 / 8.074308 (8.291508) 30.416824 / 10.191392 (20.225432) 0.987724 / 0.680424 (0.307300) 0.655695 / 0.534201 (0.121494) 0.581627 / 0.579283 (0.002344) 0.626496 / 0.434364 (0.192132) 0.397519 / 0.540337 (-0.142818) 0.396124 / 1.386936 (-0.990812)

CML watermark

Please sign in to comment.