diff --git a/.github/ISSUE_TEMPLATE/add-dataset.md b/.github/ISSUE_TEMPLATE/add-dataset.md index dd5038bd456..23505dc3359 100644 --- a/.github/ISSUE_TEMPLATE/add-dataset.md +++ b/.github/ISSUE_TEMPLATE/add-dataset.md @@ -14,4 +14,4 @@ assignees: '' - **Data:** *link to the Github repository or current dataset location* - **Motivation:** *what are some good reasons to have this dataset* -Instructions to add a new dataset can be found [here](https://github.com/huggingface/datasets/blob/master/ADD_NEW_DATASET.md). +Instructions to add a new dataset can be found [here](https://github.com/huggingface/datasets/blob/main/ADD_NEW_DATASET.md). diff --git a/.github/workflows/benchmarks.yaml b/.github/workflows/benchmarks.yaml index 81c524ff8c5..8403f4ffadb 100644 --- a/.github/workflows/benchmarks.yaml +++ b/.github/workflows/benchmarks.yaml @@ -22,7 +22,7 @@ jobs: dvc repro --force git fetch --prune - dvc metrics diff --show-json master > report.json + dvc metrics diff --show-json main > report.json python ./benchmarks/format.py report.json report.md @@ -35,7 +35,7 @@ jobs: dvc repro --force git fetch --prune - dvc metrics diff --show-json master > report.json + dvc metrics diff --show-json main > report.json python ./benchmarks/format.py report.json report.md diff --git a/.github/workflows/build_documentation.yml b/.github/workflows/build_documentation.yml index 37ac93e3730..8dac4dc0d51 100644 --- a/.github/workflows/build_documentation.yml +++ b/.github/workflows/build_documentation.yml @@ -3,7 +3,7 @@ name: Build documentation on: push: branches: - - master + - main - doc-builder* - v*-release diff --git a/.github/workflows/test-audio.yml b/.github/workflows/test-audio.yml index 68e0b8f0b3b..05d543b634b 100644 --- a/.github/workflows/test-audio.yml +++ b/.github/workflows/test-audio.yml @@ -3,7 +3,7 @@ name: Test audio on: pull_request: branches: - - master + - main jobs: test: diff --git a/.github/workflows/update-hub-repositories.yaml b/.github/workflows/update-hub-repositories.yaml index a3f912f96f2..3132d16c9f1 100644 --- a/.github/workflows/update-hub-repositories.yaml +++ b/.github/workflows/update-hub-repositories.yaml @@ -3,7 +3,7 @@ name: Update Hub repositories on: push: branches: - - master + - main jobs: update-hub-repositories: diff --git a/ADD_NEW_DATASET.md b/ADD_NEW_DATASET.md index 45e9c808bb9..589787e98db 100644 --- a/ADD_NEW_DATASET.md +++ b/ADD_NEW_DATASET.md @@ -70,11 +70,11 @@ You are now ready to start the process of adding the dataset. We will create the ```bash git fetch upstream - git rebase upstream/master + git rebase upstream/main git checkout -b a-descriptive-name-for-my-changes ``` - **Do not** work on the `master` branch. + **Do not** work on the `main` branch. 3. Create your dataset folder under `datasets/`: @@ -96,9 +96,9 @@ You are now ready to start the process of adding the dataset. We will create the - Download/open the data to see how it looks like - While you explore and read about the dataset, you can complete some sections of the dataset card (the online form or the one you have just created at `./datasets//README.md`). You can just copy the information you meet in your readings in the relevant sections of the dataset card (typically in `Dataset Description`, `Dataset Structure` and `Dataset Creation`). - If you need more information on a section of the dataset card, a detailed guide is in the `README_guide.md` here: https://github.com/huggingface/datasets/blob/master/templates/README_guide.md. + If you need more information on a section of the dataset card, a detailed guide is in the `README_guide.md` here: https://github.com/huggingface/datasets/blob/main/templates/README_guide.md. - There is a also a (very detailed) example here: https://github.com/huggingface/datasets/tree/master/datasets/eli5. + There is a also a (very detailed) example here: https://github.com/huggingface/datasets/tree/main/datasets/eli5. Don't spend too much time completing the dataset card, just copy what you find when exploring the dataset documentation. If you can't find all the information it's ok. You can always spend more time completing the dataset card while we are reviewing your PR (see below) and the dataset card will be open for everybody to complete them afterwards. If you don't know what to write in a section, just leave the `[More Information Needed]` text. @@ -109,31 +109,31 @@ Now let's get coding :-) The dataset script is the main entry point to load and process the data. It is a python script under `datasets//.py`. -There is a detailed explanation on how the library and scripts are organized [here](https://huggingface.co/docs/datasets/master/about_dataset_load.html). +There is a detailed explanation on how the library and scripts are organized [here](https://huggingface.co/docs/datasets/main/about_dataset_load.html). Note on naming: the dataset class should be camel case, while the dataset short_name is its snake case equivalent (ex: `class BookCorpus` for the dataset `book_corpus`). -To add a new dataset, you can start from the empty template which is [in the `templates` folder](https://github.com/huggingface/datasets/blob/master/templates/new_dataset_script.py): +To add a new dataset, you can start from the empty template which is [in the `templates` folder](https://github.com/huggingface/datasets/blob/main/templates/new_dataset_script.py): ```bash cp ./templates/new_dataset_script.py ./datasets//.py ``` -And then go progressively through all the `TODO` in the template 🙂. If it's your first dataset addition and you are a bit lost among the information to fill in, you can take some time to read the [detailed explanation here](https://huggingface.co/docs/datasets/master/dataset_script.html). +And then go progressively through all the `TODO` in the template 🙂. If it's your first dataset addition and you are a bit lost among the information to fill in, you can take some time to read the [detailed explanation here](https://huggingface.co/docs/datasets/main/dataset_script.html). You can also start (or copy any part) from one of the datasets of reference listed below. The main criteria for choosing among these reference dataset is the format of the data files (JSON/JSONL/CSV/TSV/text) and whether you need or don't need several configurations (see above explanations on configurations). Feel free to reuse any parts of the following examples and adapt them to your case: -- question-answering: [squad](https://github.com/huggingface/datasets/blob/master/datasets/squad/squad.py) (original data are in json) -- natural language inference: [snli](https://github.com/huggingface/datasets/blob/master/datasets/snli/snli.py) (original data are in text files with tab separated columns) -- POS/NER: [conll2003](https://github.com/huggingface/datasets/blob/master/datasets/conll2003/conll2003.py) (original data are in text files with one token per line) -- sentiment analysis: [allocine](https://github.com/huggingface/datasets/blob/master/datasets/allocine/allocine.py) (original data are in jsonl files) -- text classification: [ag_news](https://github.com/huggingface/datasets/blob/master/datasets/ag_news/ag_news.py) (original data are in csv files) -- translation: [flores](https://github.com/huggingface/datasets/blob/master/datasets/flores/flores.py) (original data come from text files - one per language) -- summarization: [billsum](https://github.com/huggingface/datasets/blob/master/datasets/billsum/billsum.py) (original data are in json files) -- benchmark: [glue](https://github.com/huggingface/datasets/blob/master/datasets/glue/glue.py) (original data are various formats) -- multilingual: [xquad](https://github.com/huggingface/datasets/blob/master/datasets/xquad/xquad.py) (original data are in json) -- multitask: [matinf](https://github.com/huggingface/datasets/blob/master/datasets/matinf/matinf.py) (original data need to be downloaded by the user because it requires authentication) -- speech recognition: [librispeech_asr](https://github.com/huggingface/datasets/blob/master/datasets/librispeech_asr/librispeech_asr.py) (original data is in .flac format) +- question-answering: [squad](https://github.com/huggingface/datasets/blob/main/datasets/squad/squad.py) (original data are in json) +- natural language inference: [snli](https://github.com/huggingface/datasets/blob/main/datasets/snli/snli.py) (original data are in text files with tab separated columns) +- POS/NER: [conll2003](https://github.com/huggingface/datasets/blob/main/datasets/conll2003/conll2003.py) (original data are in text files with one token per line) +- sentiment analysis: [allocine](https://github.com/huggingface/datasets/blob/main/datasets/allocine/allocine.py) (original data are in jsonl files) +- text classification: [ag_news](https://github.com/huggingface/datasets/blob/main/datasets/ag_news/ag_news.py) (original data are in csv files) +- translation: [flores](https://github.com/huggingface/datasets/blob/main/datasets/flores/flores.py) (original data come from text files - one per language) +- summarization: [billsum](https://github.com/huggingface/datasets/blob/main/datasets/billsum/billsum.py) (original data are in json files) +- benchmark: [glue](https://github.com/huggingface/datasets/blob/main/datasets/glue/glue.py) (original data are various formats) +- multilingual: [xquad](https://github.com/huggingface/datasets/blob/main/datasets/xquad/xquad.py) (original data are in json) +- multitask: [matinf](https://github.com/huggingface/datasets/blob/main/datasets/matinf/matinf.py) (original data need to be downloaded by the user because it requires authentication) +- speech recognition: [librispeech_asr](https://github.com/huggingface/datasets/blob/main/datasets/librispeech_asr/librispeech_asr.py) (original data is in .flac format) While you are developing the dataset script you can list test it by opening a python interpreter and running the script (the script is dynamically updated each time you modify it): @@ -286,18 +286,18 @@ Here are the step to open the Pull-Request on the main repo. It is a good idea to sync your copy of the code with the original repository regularly. This way you can quickly account for changes: - - If you haven't pushed your branch yet, you can rebase on upstream/master: + - If you haven't pushed your branch yet, you can rebase on upstream/main: ```bash git fetch upstream - git rebase upstream/master + git rebase upstream/main ``` - If you have already pushed your branch, do not rebase but merge instead: ```bash git fetch upstream - git merge upstream/master + git merge upstream/main ``` Push the changes to your account using: @@ -334,7 +334,7 @@ Creating the dataset card goes in two steps: - **Very important as well:** On the right side of the tagging app, you will also find an expandable section called **Show Markdown Data Fields**. This gives you a starting point for the description of the fields in your dataset: you should paste it into the **Data Fields** section of the [online form](https://huggingface.co/datasets/card-creator/) (or your local README.md), then modify the description as needed. Briefly describe each of the fields and indicate if they have a default value (e.g. when there is no label). If the data has span indices, describe their attributes (character level or word level, contiguous or not, etc). If the datasets contains example IDs, state whether they have an inherent meaning, such as a mapping to other datasets or pointing to relationships between data points. - Example from the [ELI5 card](https://github.com/huggingface/datasets/tree/master/datasets/eli5#data-fields): + Example from the [ELI5 card](https://github.com/huggingface/datasets/tree/main/datasets/eli5#data-fields): Data Fields: - q_id: a string question identifier for each example, corresponding to its ID in the Pushshift.io Reddit submission dumps. @@ -343,9 +343,9 @@ Creating the dataset card goes in two steps: - title_urls: list of the extracted URLs, the nth element of the list was replaced by URL_n - - **Very nice to have but optional for now:** Complete all you can find in the dataset card using the detailed instructions for completed it which are in the `README_guide.md` here: https://github.com/huggingface/datasets/blob/master/templates/README_guide.md. + - **Very nice to have but optional for now:** Complete all you can find in the dataset card using the detailed instructions for completed it which are in the `README_guide.md` here: https://github.com/huggingface/datasets/blob/main/templates/README_guide.md. - Here is a completed example: https://github.com/huggingface/datasets/tree/master/datasets/eli5 for inspiration + Here is a completed example: https://github.com/huggingface/datasets/tree/main/datasets/eli5 for inspiration If you don't know what to write in a field and can find it, write: `[More Information Needed]` diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 66b4ddea61e..89e787256bf 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -41,7 +41,7 @@ If you would like to work on any of the open Issues: git checkout -b a-descriptive-name-for-my-changes ``` - **do not** work on the `master` branch. + **do not** work on the `main` branch. 4. Set up a development environment by running the following command in a virtual environment: @@ -73,7 +73,7 @@ If you would like to work on any of the open Issues: ```bash git fetch upstream - git rebase upstream/master + git rebase upstream/main ``` Push the changes to your account using: @@ -97,15 +97,15 @@ Improving the documentation of datasets is an ever increasing effort and we invi If you see that a dataset card is missing information that you are in a position to provide (as an author of the dataset or as an experienced user), the best thing you can do is to open a Pull Request on the Hugging Face Hub. To to do, go to the "Files and versions" tab of the dataset page and edit the `README.md` file. We provide: -* a [template](https://github.com/huggingface/datasets/blob/master/templates/README.md) -* a [guide](https://github.com/huggingface/datasets/blob/master/templates/README_guide.md) describing what information should go into each of the paragraphs -* and if you need inspiration, we recommend looking through a [completed example](https://github.com/huggingface/datasets/blob/master/datasets/eli5/README.md) +* a [template](https://github.com/huggingface/datasets/blob/main/templates/README.md) +* a [guide](https://github.com/huggingface/datasets/blob/main/templates/README_guide.md) describing what information should go into each of the paragraphs +* and if you need inspiration, we recommend looking through a [completed example](https://github.com/huggingface/datasets/blob/main/datasets/eli5/README.md) Note that datasets that are outside of a namespace (`squad`, `imagenet-1k`, etc.) are maintained on GitHub. In this case you have to open a Pull request on GitHub to edit the file at `datasets//README.md`. If you are a **dataset author**... you know what to do, it is your dataset after all ;) ! We would especially appreciate if you could help us fill in information about the process of creating the dataset, and take a moment to reflect on its social impact and possible limitations if you haven't already done so in the dataset paper or in another data statement. -If you are a **user of a dataset**, the main source of information should be the dataset paper if it is available: we recommend pulling information from there into the relevant paragraphs of the template. We also eagerly welcome discussions on the [Considerations for Using the Data](https://github.com/huggingface/datasets/blob/master/templates/README_guide.md#considerations-for-using-the-data) based on existing scholarship or personal experience that would benefit the whole community. +If you are a **user of a dataset**, the main source of information should be the dataset paper if it is available: we recommend pulling information from there into the relevant paragraphs of the template. We also eagerly welcome discussions on the [Considerations for Using the Data](https://github.com/huggingface/datasets/blob/main/templates/README_guide.md#considerations-for-using-the-data) based on existing scholarship or personal experience that would benefit the whole community. Finally, if you want more information on the how and why of dataset cards, we strongly recommend reading the foundational works [Datasheets for Datasets](https://arxiv.org/abs/1803.09010) and [Data Statements for NLP](https://www.aclweb.org/anthology/Q18-1041/). diff --git a/README.md b/README.md index b7f7299fe98..f652bbc727d 100644 --- a/README.md +++ b/README.md @@ -1,13 +1,13 @@


- +

- Build + Build - + GitHub @@ -30,12 +30,12 @@ - **one-line dataloaders for many public datasets**: one-liners to download and pre-process any of the ![number of datasets](https://img.shields.io/endpoint?url=https://huggingface.co/api/shields/datasets&color=brightgreen) major public datasets (text datasets in 467 languages and dialects, image datasets, audio datasets, etc.) provided on the [HuggingFace Datasets Hub](https://huggingface.co/datasets). With a simple command like `squad_dataset = load_dataset("squad")`, get any of these datasets ready to use in a dataloader for training/evaluating a ML model (Numpy/Pandas/PyTorch/TensorFlow/JAX), - **efficient data pre-processing**: simple, fast and reproducible data pre-processing for the above public datasets as well as your own local datasets in CSV/JSON/text/PNG/JPEG/etc. With simple commands like `processed_dataset = dataset.map(process_example)`, efficiently prepare the dataset for inspection and ML model evaluation and training. -[🎓 **Documentation**](https://huggingface.co/docs/datasets/) [🕹 **Colab tutorial**](https://colab.research.google.com/github/huggingface/datasets/blob/master/notebooks/Overview.ipynb) +[🎓 **Documentation**](https://huggingface.co/docs/datasets/) [🕹 **Colab tutorial**](https://colab.research.google.com/github/huggingface/datasets/blob/main/notebooks/Overview.ipynb) -[🔎 **Find a dataset in the Hub**](https://huggingface.co/datasets) [🌟 **Add a new dataset to the Hub**](https://github.com/huggingface/datasets/blob/master/ADD_NEW_DATASET.md) +[🔎 **Find a dataset in the Hub**](https://huggingface.co/datasets) [🌟 **Add a new dataset to the Hub**](https://github.com/huggingface/datasets/blob/main/ADD_NEW_DATASET.md)

- +

🤗 Datasets also provides access to +40 evaluation metrics and is designed to let the community easily add and share new datasets and evaluation metrics. @@ -127,7 +127,7 @@ For more details on using the library, check the quick start page in the documen - etc. Another introduction to 🤗 Datasets is the tutorial on Google Colab here: -[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/datasets/blob/master/notebooks/Overview.ipynb) +[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/datasets/blob/main/notebooks/Overview.ipynb) # Add a new dataset to the Hub @@ -135,7 +135,7 @@ We have a very detailed step-by-step guide to add a new dataset to the ![number You will find [the step-by-step guide here](https://huggingface.co/docs/datasets/share.html) to add a dataset on the Hub. -However if you prefer to add your dataset in this repository, you can find the guide [here](https://github.com/huggingface/datasets/blob/master/ADD_NEW_DATASET.md). +However if you prefer to add your dataset in this repository, you can find the guide [here](https://github.com/huggingface/datasets/blob/main/ADD_NEW_DATASET.md). # Main differences between 🤗 Datasets and `tfds` diff --git a/c.py b/c.py new file mode 100644 index 00000000000..9ce195cb096 --- /dev/null +++ b/c.py @@ -0,0 +1,213 @@ +# coding=utf-8 +# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# Special thanks to @lvwerra -- we reference his repository here: https://huggingface.co/datasets/lvwerra/github-code/ +"""Code Clippy Github Code dataset.""" + +import os + + +import datasets +from huggingface_hub import HfApi, HfFolder +from datasets.data_files import DataFilesDict + +import gzip +import json + +_REPO_NAME = "CodedotAI/code_clippy_github" + +_LANG_TO_EXTENSION = { + "C": [".c"], + "C#": [".cs"], + "C++": [".cpp"], + "CSS": [".css"], + "Dart" : [".dart"], + "GO": [".go"], + "HTML":[".html"], + "Java": [".java"], + "JavaScript": [".js"], + "Jupyter Notebooks (Python)": [".ipynb"], + "Kotlin" : [".kt"], + "Lisp" : [".lisp"], + "Matlab" : [".m"], + "PHP": [".php"], + "Perl": [".pl"], + "Python": [".py"], + "R" : [".r"], + "Ruby": [".rb"], + "Rust": [".rs"], + "SQL": [".sql"], + "Shell": [".sh"], + "Swift" : [".swift"], + "TypeScript": [".ts"], +} + +_LICENSES = [ + 'mit', + 'apache-2.0', + 'gpl-2.0', + 'gpl-3.0', + 'bsd-3-clause', + 'bsd-2-clause', + 'unlicense', + 'apacheagpl-3.0', + 'lgpl-3.0', + 'cc0-1.0', + 'epl-1.0', + 'lgpl-2.1', + 'mpl-2.0', + 'isc', + 'artistic-2.0' + ] + +_DESCRIPTION = """\ +The Code Clippy dataset consists of various public codebases from GitHub in 22 programming languages with 23 extensions \ + totalling about 16 TB of data when uncompressed. The dataset was created from the public GitHub dataset on Google BiqQuery. +""" + +_HOMEPAGE = "https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code/" + + +_EXTENSION_TO_LANG = {} +for lang in _LANG_TO_EXTENSION: + for extension in _LANG_TO_EXTENSION[lang]: + _EXTENSION_TO_LANG[extension] = lang + + + +_LANG_CONFIGS = ["all"] + list(_LANG_TO_EXTENSION.keys()) +_LICENSE_CONFIGS = ["all"] + _LICENSES + +class CodeClippyGithubConfig(datasets.BuilderConfig): + """BuilderConfig for the Code Clippy Github dataset.""" + + def __init__(self, *args, languages=["all"], licenses=["all"], **kwargs): + """BuilderConfig for the Code Clippy Github dataset. + Args: + languages (:obj:`List[str]`): List of languages to load. + licenses (:obj:`List[str]`): List of licenses to load. + **kwargs: keyword arguments forwarded to super. + """ + super().__init__( + *args, + name="+".join(languages)+"-"+"+".join(licenses), + **kwargs, + ) + + languages = set(languages) + licenses = set(licenses) + + assert all([language in _LANG_CONFIGS for language in languages]), f"Language not in {_LANG_CONFIGS}." + assert all([license in _LICENSE_CONFIGS for license in licenses]), f"License not in {_LICENSE_CONFIGS}." + + if "all" in languages: + assert len(languages)==1, "Passed 'all' together with other languages." + self.filter_languages = False + else: + self.filter_languages = True + + if "all" in licenses: + assert len(licenses)==1, "Passed 'all' together with other licenses." + self.filter_licenses = False + else: + self.filter_licenses = True + + self.languages = set(languages) + self.licenses = set(licenses) + + + +class CodeClippyGithub(datasets.GeneratorBasedBuilder): + """Code Clippy Github dataset.""" + + VERSION = datasets.Version("1.0.0") + + BUILDER_CONFIG_CLASS = CodeClippyGithubConfig + BUILDER_CONFIGS = [CodeClippyGithubConfig(languages=[lang], licenses=[license]) for lang in _LANG_CONFIGS + for license in _LICENSE_CONFIGS] + DEFAULT_CONFIG_NAME = "all-all" + + + def _info(self): + return datasets.DatasetInfo( + description=_DESCRIPTION, + features=datasets.Features({"code_text": datasets.Value("string"), + "repo_name": datasets.Value("string"), + "file_path": datasets.Value("string"), + "language": datasets.Value("string"), + "license": datasets.Value("string"), + "size": datasets.Value("int32")}), + supervised_keys=None, + homepage=_HOMEPAGE, + license="Multiple: see the 'license' field of each sample.", + + ) + + def _split_generators(self, dl_manager): + + hfh_dataset_info = HfApi(datasets.config.HF_ENDPOINT).dataset_info( + _REPO_NAME, + timeout=100.0, + ) + + patterns = datasets.data_files.get_patterns_in_dataset_repository(hfh_dataset_info) + data_files = datasets.data_files.DataFilesDict.from_hf_repo( + patterns, + dataset_info=hfh_dataset_info, + ) + + files = dl_manager.download_and_extract(data_files["train"]) + return [ + datasets.SplitGenerator( + name=datasets.Split.TRAIN, + gen_kwargs={ + "files": files, + }, + ), + ] + + def _generate_examples(self, files): + key = 0 + for file_idx, file in enumerate(files): + with gzip.open(file, "rb") as f: + + uncompressed_data = f.readlines() + + for batch_idx, code_base in enumerate(uncompressed_data): + j_dict = json.loads(code_base.decode('utf-8')) + + + + lang = lang_from_name(j_dict['path']) + license = j_dict["license"] + + if self.config.filter_languages and not lang in self.config.languages: + continue + if self.config.filter_licenses and not license in self.config.licenses: + continue + # TODO: Add more features like header comments, filename, and other features useful in a prompt. + yield key, {"code_text": j_dict['content'], + "repo_name": j_dict['repo_name'], + "file_path": j_dict['path'], + "license": license, + "language": lang, + "size": int(j_dict['f0_'])} + key += 1 + + +def lang_from_name(name): + for extension in _EXTENSION_TO_LANG: + if name.endswith(extension): + return _EXTENSION_TO_LANG[extension] \ No newline at end of file diff --git a/datasets/arabic_speech_corpus/README.md b/datasets/arabic_speech_corpus/README.md index 7e93a15f8fc..14e14262549 100644 --- a/datasets/arabic_speech_corpus/README.md +++ b/datasets/arabic_speech_corpus/README.md @@ -26,7 +26,7 @@ train-eval-index: train_split: train eval_split: test col_mapping: - file: path + audio: audio text: text metrics: - type: wer diff --git a/docs/source/_config.py b/docs/source/_config.py index 05d2fc8898f..01c7debd75a 100644 --- a/docs/source/_config.py +++ b/docs/source/_config.py @@ -1,2 +1,2 @@ -default_branch_name = "master" +default_branch_name = "main" version_prefix = "" diff --git a/docs/source/dataset_card.mdx b/docs/source/dataset_card.mdx index a7529627ae0..c2b4b7ad4b6 100644 --- a/docs/source/dataset_card.mdx +++ b/docs/source/dataset_card.mdx @@ -5,7 +5,7 @@ This idea is inspired by the Model Cards proposed by [Mitchell, 2018](https://ar Dataset cards help users understand the contents of the dataset, context for how the dataset should be used, how it was created, and considerations for using the dataset. This guide shows you how to create your own Dataset card. -1. Create a new Dataset card by opening the [online card creator](https://huggingface.co/datasets/card-creator/), or manually copying the template from [here](https://raw.githubusercontent.com/huggingface/datasets/master/templates/README.md). +1. Create a new Dataset card by opening the [online card creator](https://huggingface.co/datasets/card-creator/), or manually copying the template from [here](https://raw.githubusercontent.com/huggingface/datasets/main/templates/README.md). 2. Next, you need to generate structured tags. The tags help users discover your dataset on the Hub. Create the tags with the [online Datasets Tagging app](https://huggingface.co/spaces/huggingface/datasets-tagging). @@ -15,7 +15,7 @@ This guide shows you how to create your own Dataset card. 5. Expand the **Show Markdown Data Fields** section, paste it into the **Data Fields** section under **Data Structure** on the online form (or your local `README.md`). Modify the descriptions as needed, and briefly describe each of the fields. -6. Fill out the Dataset card to the best of your ability. Refer to the [Dataset Card Creation Guide](https://github.com/huggingface/datasets/blob/master/templates/README_guide.md) for more detailed information about each section of the card. For fields you are unable to complete, you can write **[More Information Needed]**. +6. Fill out the Dataset card to the best of your ability. Refer to the [Dataset Card Creation Guide](https://github.com/huggingface/datasets/blob/main/templates/README_guide.md) for more detailed information about each section of the card. For fields you are unable to complete, you can write **[More Information Needed]**. 7. Once you are done filling out the card with the online form, click the **Export** button to download the Dataset card. Place it in the same folder as your dataset. diff --git a/docs/source/dataset_script.mdx b/docs/source/dataset_script.mdx index dbc81fd706e..5750c027bd7 100644 --- a/docs/source/dataset_script.mdx +++ b/docs/source/dataset_script.mdx @@ -30,7 +30,7 @@ Open the [SQuAD dataset loading script](https://huggingface.co/datasets/squad/bl -To help you get started, try beginning with the dataset loading script [template](https://github.com/huggingface/datasets/blob/master/templates/new_dataset_script.py)! +To help you get started, try beginning with the dataset loading script [template](https://github.com/huggingface/datasets/blob/main/templates/new_dataset_script.py)! @@ -92,7 +92,7 @@ def _info(self): In some cases, your dataset may have multiple configurations. For example, the [SuperGLUE](https://huggingface.co/datasets/super_glue) dataset is a collection of 5 datasets designed to evaluate language understanding tasks. 🤗 Datasets provides [`BuilderConfig`] which allows you to create different configurations for the user to select from. -Let's study the [SuperGLUE loading script](https://github.com/huggingface/datasets/blob/master/datasets/super_glue/super_glue.py) to see how you can define several configurations. +Let's study the [SuperGLUE loading script](https://github.com/huggingface/datasets/blob/main/datasets/super_glue/super_glue.py) to see how you can define several configurations. 1. Create a [`BuilderConfig`] subclass with attributes about your dataset. These attributes can be the features of your dataset, label classes, and a URL to the data files. diff --git a/docs/source/how_to_metrics.mdx b/docs/source/how_to_metrics.mdx index 25b05c73443..fa2fa6a49ea 100644 --- a/docs/source/how_to_metrics.mdx +++ b/docs/source/how_to_metrics.mdx @@ -76,11 +76,11 @@ Args: Write a metric loading script to use your own custom metric (or one that is not on the Hub). Then you can load it as usual with [`load_metric`]. -To help you get started, open the [SQuAD metric loading script](https://github.com/huggingface/datasets/blob/master/metrics/squad/squad.py) and follow along. +To help you get started, open the [SQuAD metric loading script](https://github.com/huggingface/datasets/blob/main/metrics/squad/squad.py) and follow along. -Get jump started with our metric loading script [template](https://github.com/huggingface/datasets/blob/master/templates/new_metric_script.py)! +Get jump started with our metric loading script [template](https://github.com/huggingface/datasets/blob/main/templates/new_metric_script.py)! @@ -126,7 +126,7 @@ class Squad(datasets.Metric): ### Download metric files -If your metric needs to download, or retrieve local files, you will need to use the [`Metric._download_and_prepare`] method. For this example, let's examine the [BLEURT metric loading script](https://github.com/huggingface/datasets/blob/master/metrics/bleurt/bleurt.py). +If your metric needs to download, or retrieve local files, you will need to use the [`Metric._download_and_prepare`] method. For this example, let's examine the [BLEURT metric loading script](https://github.com/huggingface/datasets/blob/main/metrics/bleurt/bleurt.py). 1. Provide a dictionary of URLs that point to the metric files: @@ -171,7 +171,7 @@ def _download_and_prepare(self, dl_manager): ### Compute score -[`DatasetBuilder._compute`] provides the actual instructions for how to compute a metric given the predictions and references. Now let's take a look at the [GLUE metric loading script](https://github.com/huggingface/datasets/blob/master/metrics/glue/glue.py). +[`DatasetBuilder._compute`] provides the actual instructions for how to compute a metric given the predictions and references. Now let's take a look at the [GLUE metric loading script](https://github.com/huggingface/datasets/blob/main/metrics/glue/glue.py). 1. Provide the functions for [`DatasetBuilder._compute`] to calculate your metric: diff --git a/docs/source/share.mdx b/docs/source/share.mdx index cf0eb88926b..f020fcea829 100644 --- a/docs/source/share.mdx +++ b/docs/source/share.mdx @@ -160,4 +160,4 @@ The code of these datasets are reviewed by the Hugging Face team, and they requi In some rare cases it makes more sense to open a PR on GitHub. For example when you are not the author of the dataset and there is no clear organization / namespace that you can put the dataset under. -For more info, please take a look at the documentation on [How to add a new dataset in the huggingface/datasets repository](https://github.com/huggingface/datasets/blob/master/ADD_NEW_DATASET.md). +For more info, please take a look at the documentation on [How to add a new dataset in the huggingface/datasets repository](https://github.com/huggingface/datasets/blob/main/ADD_NEW_DATASET.md). diff --git a/docs/source/upload_dataset.mdx b/docs/source/upload_dataset.mdx index 6500ffffb38..db5b221ea9d 100644 --- a/docs/source/upload_dataset.mdx +++ b/docs/source/upload_dataset.mdx @@ -49,7 +49,7 @@ Adding a Dataset card is super valuable for helping users find your dataset and -2. Feel free to copy this Dataset card [template](https://raw.githubusercontent.com/huggingface/datasets/master/templates/README.md) to help you fill out all the relevant fields. +2. Feel free to copy this Dataset card [template](https://raw.githubusercontent.com/huggingface/datasets/main/templates/README.md) to help you fill out all the relevant fields. 3. The Dataset card uses structured tags to help users discover your dataset on the Hub. Use the [Dataset Tagger](https://huggingface.co/spaces/huggingface/datasets-tagging) to help you generate the appropriate tags. @@ -87,7 +87,7 @@ pip install huggingface_hub huggingface-cli login ``` -3. Use the [`push_to_hub()`](https://huggingface.co/docs/datasets/master/en/package_reference/main_classes#datasets.DatasetDict.push_to_hub) function to help you add, commit, and push a file to your repository: +3. Use the [`push_to_hub()`](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.DatasetDict.push_to_hub) function to help you add, commit, and push a file to your repository: ```py >>> from datasets import load_dataset diff --git a/metrics/exact_match/README.md b/metrics/exact_match/README.md index 38c9eebfbc3..47df0ba8943 100644 --- a/metrics/exact_match/README.md +++ b/metrics/exact_match/README.md @@ -100,4 +100,4 @@ This metric is limited in that it outputs the same score for something that is c ## Citation ## Further References -- Also used in the [SQuAD metric](https://github.com/huggingface/datasets/tree/master/metrics/squad) +- Also used in the [SQuAD metric](https://github.com/huggingface/datasets/tree/main/metrics/squad) diff --git a/notebooks/Overview.ipynb b/notebooks/Overview.ipynb index f12d0a51c21..135a851148a 100644 --- a/notebooks/Overview.ipynb +++ b/notebooks/Overview.ipynb @@ -10,7 +10,7 @@ } }, "source": [ - "\"Open" + "\"Open" ] }, { diff --git a/setup.py b/setup.py index 03110370c6f..ecc20a4a66b 100644 --- a/setup.py +++ b/setup.py @@ -24,7 +24,7 @@ 2. Commit these changes: "git commit -m 'Release: VERSION'" 3. Add a tag in git to mark the release: "git tag VERSION -m 'Add tag VERSION for pypi'" - Push the tag to remote: git push --tags origin master + Push the tag to remote: git push --tags origin main 4. Build both the sources and the wheel. Do not change anything in setup.py between creating the wheel and the source distribution (obviously). diff --git a/src/datasets/__init__.py b/src/datasets/__init__.py index 6ac94b2ec64..bddbfde6aad 100644 --- a/src/datasets/__init__.py +++ b/src/datasets/__init__.py @@ -29,7 +29,7 @@ "If you are running this in a Google Colab, you should probably just restart the runtime to use the right version of `pyarrow`." ) -SCRIPTS_VERSION = "master" if version.parse(__version__).is_devrelease else __version__ +SCRIPTS_VERSION = "main" if version.parse(__version__).is_devrelease else __version__ del pyarrow del version diff --git a/src/datasets/inspect.py b/src/datasets/inspect.py index c3363ab15ed..2e7e702e766 100644 --- a/src/datasets/inspect.py +++ b/src/datasets/inspect.py @@ -195,7 +195,7 @@ def get_dataset_infos( If specified, the dataset module will be loaded from the datasets repository at this version. By default: - it is set to the local version of the lib. - - it will also try to load it from the master branch if it's not available at the local version of the lib. + - it will also try to load it from the main branch if it's not available at the local version of the lib. Specifying a version that is different from your local version of the lib might cause compatibility issues. download_config (:class:`DownloadConfig`, optional): Specific download configuration parameters. download_mode (:class:`DownloadMode`, default ``REUSE_DATASET_IF_EXISTS``): Download/generate mode. @@ -256,7 +256,7 @@ def get_dataset_config_names( If specified, the dataset module will be loaded from the datasets repository at this version. By default: - it is set to the local version of the lib. - - it will also try to load it from the master branch if it's not available at the local version of the lib. + - it will also try to load it from the main branch if it's not available at the local version of the lib. Specifying a version that is different from your local version of the lib might cause compatibility issues. download_config (:class:`DownloadConfig`, optional): Specific download configuration parameters. download_mode (:class:`DownloadMode`, default ``REUSE_DATASET_IF_EXISTS``): Download/generate mode. @@ -325,7 +325,7 @@ def get_dataset_config_info( revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load: - For datasets in the `huggingface/datasets` library on GitHub like "squad", the default version of the module is the local version of the lib. - You can specify a different version from your local version of the lib (e.g. "master" or "1.2.0") but it might cause compatibility issues. + You can specify a different version from your local version of the lib (e.g. "main" or "1.2.0") but it might cause compatibility issues. - For community datasets like "lhoestq/squad" that have their own git repository on the Datasets Hub, the default version "main" corresponds to the "main" branch. You can specify a different version that the default "main" by using a commit sha or a git tag of the dataset repository. use_auth_token (``str`` or :obj:`bool`, optional): Optional string or boolean to use as Bearer token for remote files on the Datasets Hub. @@ -386,7 +386,7 @@ def get_dataset_split_names( revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load: - For datasets in the `huggingface/datasets` library on GitHub like "squad", the default version of the module is the local version of the lib. - You can specify a different version from your local version of the lib (e.g. "master" or "1.2.0") but it might cause compatibility issues. + You can specify a different version from your local version of the lib (e.g. "main" or "1.2.0") but it might cause compatibility issues. - For community datasets like "lhoestq/squad" that have their own git repository on the Datasets Hub, the default version "main" corresponds to the "main" branch. You can specify a different version that the default "main" by using a commit sha or a git tag of the dataset repository. use_auth_token (``str`` or :obj:`bool`, optional): Optional string or boolean to use as Bearer token for remote files on the Datasets Hub. diff --git a/src/datasets/load.py b/src/datasets/load.py index 2c0cc7bfdf8..cb56f8b8167 100644 --- a/src/datasets/load.py +++ b/src/datasets/load.py @@ -476,11 +476,11 @@ def get_module(self) -> DatasetModule: if revision is not None or os.getenv("HF_SCRIPTS_VERSION", None) is not None: raise else: - revision = "master" + revision = "main" local_path = self.download_loading_script(revision) logger.warning( f"Couldn't find a directory or a dataset named '{self.name}' in this version. " - f"It was picked from the master branch on github instead." + f"It was picked from the main branch on github instead." ) dataset_infos_path = self.download_dataset_infos_file(revision) imports = get_imports(local_path) @@ -546,11 +546,11 @@ def get_module(self) -> MetricModule: if revision is not None or os.getenv("HF_SCRIPTS_VERSION", None) is not None: raise else: - revision = "master" + revision = "main" local_path = self.download_loading_script(revision) logger.warning( f"Couldn't find a directory or a metric named '{self.name}' in this version. " - f"It was picked from the master branch on github instead." + f"It was picked from the main branch on github instead." ) imports = get_imports(local_path) local_imports = _download_additional_modules( @@ -1095,7 +1095,7 @@ def dataset_module_factory( revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load: - For datasets in the `huggingface/datasets` library on GitHub like "squad", the default version of the module is the local version of the lib. - You can specify a different version from your local version of the lib (e.g. "master" or "1.2.0") but it might cause compatibility issues. + You can specify a different version from your local version of the lib (e.g. "main" or "1.2.0") but it might cause compatibility issues. - For community datasets like "lhoestq/squad" that have their own git repository on the Datasets Hub, the default version "main" corresponds to the "main" branch. You can specify a different version that the default "main" by using a commit sha or a git tag of the dataset repository. download_config (:class:`DownloadConfig`, optional): Specific download configuration parameters. @@ -1278,7 +1278,7 @@ def metric_module_factory( If specified, the module will be loaded from the datasets repository at this version. By default: - it is set to the local version of the lib. - - it will also try to load it from the master branch if it's not available at the local version of the lib. + - it will also try to load it from the main branch if it's not available at the local version of the lib. Specifying a version that is different from your local version of the lib might cause compatibility issues. download_config (:class:`DownloadConfig`, optional): Specific download configuration parameters. download_mode (:class:`DownloadMode`, default ``REUSE_DATASET_IF_EXISTS``): Download/generate mode. @@ -1464,7 +1464,7 @@ def load_dataset_builder( revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load: - For datasets in the `huggingface/datasets` library on GitHub like "squad", the default version of the module is the local version of the lib. - You can specify a different version from your local version of the lib (e.g. "master" or "1.2.0") but it might cause compatibility issues. + You can specify a different version from your local version of the lib (e.g. "main" or "1.2.0") but it might cause compatibility issues. - For community datasets like "lhoestq/squad" that have their own git repository on the Datasets Hub, the default version "main" corresponds to the "main" branch. You can specify a different version that the default "main" by using a commit sha or a git tag of the dataset repository. use_auth_token (``str`` or :obj:`bool`, optional): Optional string or boolean to use as Bearer token for remote files on the Datasets Hub. @@ -1576,7 +1576,7 @@ def load_dataset( Dataset scripts are small python scripts that define dataset builders. They define the citation, info and format of the dataset, contain the path or URL to the original data files and the code to load examples from the original data files. - You can find some of the scripts here: https://github.com/huggingface/datasets/tree/master/datasets + You can find some of the scripts here: https://github.com/huggingface/datasets/tree/main/datasets You can find the complete list of datasets in the Datasets Hub at https://huggingface.co/datasets 2. Run the dataset script which will: @@ -1635,7 +1635,7 @@ def load_dataset( revision (:class:`~utils.Version` or :obj:`str`, optional): Version of the dataset script to load: - For datasets in the `huggingface/datasets` library on GitHub like "squad", the default version of the module is the local version of the lib. - You can specify a different version from your local version of the lib (e.g. "master" or "1.2.0") but it might cause compatibility issues. + You can specify a different version from your local version of the lib (e.g. "main" or "1.2.0") but it might cause compatibility issues. - For community datasets like "lhoestq/squad" that have their own git repository on the Datasets Hub, the default version "main" corresponds to the "main" branch. You can specify a different version that the default "main" by using a commit sha or a git tag of the dataset repository. use_auth_token (``str`` or :obj:`bool`, optional): Optional string or boolean to use as Bearer token for remote files on the Datasets Hub. diff --git a/tests/test_dataset_cards.py b/tests/test_dataset_cards.py index dab7ce2b7e7..cd56c9f32c9 100644 --- a/tests/test_dataset_cards.py +++ b/tests/test_dataset_cards.py @@ -32,7 +32,7 @@ def get_changed_datasets(repo_path: Path) -> List[Path]: - diff_output = check_output(["git", "diff", "--name-only", "origin/master...HEAD"], cwd=repo_path) + diff_output = check_output(["git", "diff", "--name-only", "origin/main...HEAD"], cwd=repo_path) changed_files = [Path(repo_path, f) for f in diff_output.decode().splitlines()] datasets_dir_path = repo_path / "datasets" diff --git a/tests/test_load.py b/tests/test_load.py index 30d5ba5b691..712f9b55e93 100644 --- a/tests/test_load.py +++ b/tests/test_load.py @@ -576,7 +576,7 @@ def test_load_dataset_from_github(self): with self.assertRaises(FileNotFoundError) as context: datasets.load_dataset("_dummy") self.assertIn( - "https://raw.githubusercontent.com/huggingface/datasets/master/datasets/_dummy/_dummy.py", + "https://raw.githubusercontent.com/huggingface/datasets/main/datasets/_dummy/_dummy.py", str(context.exception), ) with self.assertRaises(FileNotFoundError) as context: