Follow this guide to add a dataset to TFDS.
See our list of datasets to see if the dataset you want isn't already added.
- Overview
- Writing
my_dataset.py
- Specifying
DatasetInfo
- Downloading and extracting source data
- Specifying dataset splits
- Writing an example generator
- Dataset configuration
- Create your own
FeatureConnector
- Adding the dataset to
tensorflow/datasets
- Define the dataset outside TFDS
- Large datasets and distributed generation
- Testing
MyDataset
Datasets are distributed in all kinds of formats and in all kinds of places, and they're not always stored in a format that's ready to feed into a machine learning pipeline. Enter TFDS.
TFDS provides a way to transform all those datasets into a standard format,
do the preprocessing necessary to make them ready for a machine learning
pipeline, and provides a standard input pipeline using tf.data
.
To enable this, each dataset implements a subclass of DatasetBuilder
, which
specifies:
- Where the data is coming from (i.e. its URL);
- What the dataset looks like (i.e. its features);
- How the data should be split (e.g.
TRAIN
andTEST
); - and the individual records in the dataset.
The first time a dataset is used, the dataset is downloaded, prepared, and written to disk in a standard format. Subsequent access will read from those pre-processed files directly.
Note: Currently we do not support datasets that take longer than 1 day to generate on a single machine. See the section below on large datasets.
If you want to
contribute to our repo
and add a new dataset, the following script will help you get started by
generating the required python files,...
To use it, clone the tfds
repository and run the following command:
python tensorflow_datasets/scripts/create_new_dataset.py \
--dataset my_dataset \
--type image # text, audio, translation,...
Then search for TODO(my_dataset)
in the generated files to do the
modifications.
Each dataset is defined as a subclass of
tfds.core.DatasetBuilder
implementing the following methods:
_info
: builds theDatasetInfo
object describing the dataset_download_and_prepare
: to download and serialize the source data to disk_as_dataset
: to produce atf.data.Dataset
from the serialized data
Most datasets subclass
tfds.core.GeneratorBasedBuilder
,
which is a subclass of tfds.core.DatasetBuilder
that simplifies defining a
dataset. It works well for datasets that can be generated on a single machine.
Its subclasses implement:
_info
: builds theDatasetInfo
object describing the dataset_split_generators
: downloads the source data and defines the dataset splits_generate_examples
: yields(key, example)
tuples in the dataset from the source data
This guide will use GeneratorBasedBuilder
.
my_dataset.py
first looks like this:
import tensorflow_datasets.public_api as tfds
class MyDataset(tfds.core.GeneratorBasedBuilder):
"""Short description of my dataset."""
VERSION = tfds.core.Version('0.1.0')
def _info(self):
# Specifies the tfds.core.DatasetInfo object
pass # TODO
def _split_generators(self, dl_manager):
# Downloads the data and defines the splits
# dl_manager is a tfds.download.DownloadManager that can be used to
# download and extract URLs
pass # TODO
def _generate_examples(self):
# Yields examples from the dataset
yield 'key', {}
If you'd like to follow a test-driven development workflow, which can help you iterate faster, jump to the testing instructions below, add the test, and then return here.
For an explanation of what the version is, please read datasets versioning.
DatasetInfo
describes the
dataset.
class MyDataset(tfds.core.GeneratorBasedBuilder):
def _info(self):
return tfds.core.DatasetInfo(
builder=self,
# This is the description that will appear on the datasets page.
description=("This is the dataset for xxx. It contains yyy. The "
"images are kept at their original dimensions."),
# tfds.features.FeatureConnectors
features=tfds.features.FeaturesDict({
"image_description": tfds.features.Text(),
"image": tfds.features.Image(),
# Here, labels can be of 5 distinct values.
"label": tfds.features.ClassLabel(num_classes=5),
}),
# If there's a common (input, target) tuple from the features,
# specify them here. They'll be used if as_supervised=True in
# builder.as_dataset.
supervised_keys=("image", "label"),
# Homepage of the dataset for documentation
homepage="https://dataset-homepage.org",
# Bibtex citation for the dataset
citation=r"""@article{my-awesome-dataset-2020,
author = {Smith, John},"}""",
)
Each feature is specified in DatasetInfo
as a
tfds.features.FeatureConnector
.
FeatureConnector
s document each feature, provide shape and type checks, and
abstract away serialization to and from disk. There are many feature types
already defined and you can also
add a new one.
If you've implemented the test harness, test_info
should now pass.
Most datasets need to download data from the web. All downloads and
extractions must go through the
tfds.download.DownloadManager
.
DownloadManager
currently
supports extracting .zip
, .gz
, and .tar
files.
For example, one can both download and extract URLs with download_and_extract
:
def _split_generators(self, dl_manager):
# Equivalent to dl_manager.extract(dl_manager.download(urls))
dl_paths = dl_manager.download_and_extract({
'foo': 'https://example.com/foo.zip',
'bar': 'https://example.com/bar.zip',
})
dl_paths['foo'], dl_paths['bar']
For source data that cannot be automatically downloaded (for
example, it may require a login), the user will manually download the source
data and place it in manual_dir
, which you can access with
dl_manager.manual_dir
(defaults to ~/tensorflow_datasets/manual/my_dataset
).
If the dataset comes with pre-defined splits (for example, MNIST has train and
test splits), keep those splits in the DatasetBuilder
. If this is your own
data and you can decide your own splits, we suggest using a split of
(TRAIN:80%, VALIDATION: 10%, TEST: 10%)
. Users can always get subsplits
through tfds.Split.subsplit
.
def _split_generators(self, dl_manager):
# Download source data
extracted_path = dl_manager.download_and_extract(...)
# Specify the splits
return [
tfds.core.SplitGenerator(
name=tfds.Split.TRAIN,
gen_kwargs={
"images_dir_path": os.path.join(extracted_path, "train"),
"labels": os.path.join(extracted_path, "train_labels.csv"),
},
),
tfds.core.SplitGenerator(
name=tfds.Split.TEST,
gen_kwargs={
"images_dir_path": os.path.join(extracted_path, "test"),
"labels": os.path.join(extracted_path, "test_labels.csv"),
},
),
]
SplitGenerator
describes how a split should be generated. gen_kwargs
will be passed as keyword arguments to _generate_examples
, which we'll define
next.
_generate_examples
generates the examples for each split from the
source data. For the TRAIN
split with the gen_kwargs
defined above,
_generate_examples
will be called as:
builder._generate_examples(
images_dir_path="{extracted_path}/train",
labels="{extracted_path}/train_labels.csv",
)
This method will typically read source dataset artifacts (e.g. a CSV file) and
yield (key, feature dictionary) tuples that correspond to the features specified
in DatasetInfo
.
def _generate_examples(self, images_dir_path, labels):
# Read the input data out of the source files
for image_file in tf.io.gfile.listdir(images_dir_path):
...
with tf.io.gfile.GFile(labels) as f:
...
# And yield examples as feature dictionaries
for image_id, description, label in data:
yield image_id, {
"image_description": description,
"image": "%s/%s.jpeg" % (images_dir_path, image_id),
"label": label,
}
DatasetInfo.features.encode_example
will encode these dictionaries into a
format suitable for writing to disk (currently we use tf.train.Example
protocol buffers). For example, tfds.features.Image
will copy out the
JPEG content of the passed image files automatically.
The key (here: image_id
) should uniquely identify the record. It is used to
shuffle the dataset globally. If two records are yielded using the same key,
an exception will be raised during preparation of the dataset.
If you've implemented the test harness, your builder test should now pass.
In order to support Cloud storage systems, use
tf.io.gfile
or other TensorFlow file APIs (for example, tf.python_io
)
for all filesystem access. Avoid using Python built-ins for file operations
(e.g. open
, os.rename
, gzip
, etc.).
Some datasets require additional Python dependencies during data generation.
For example, the SVHN dataset uses scipy
to load some data. In order to
keep the tensorflow-datasets
package small and allow users to install
additional dependencies only as needed, use tfds.core.lazy_imports
.
To use lazy_imports
:
- Add an entry for your dataset into
DATASET_EXTRAS
insetup.py
. This makes it so that users can do, for example,pip install 'tensorflow-datasets[svhn]'
to install the extra dependencies. - Add an entry for your import to
LazyImporter
and to theLazyImportsTest
. - Use
tfds.core.lazy_imports
to access the dependency (for example,tfds.core.lazy_imports.scipy
) in yourDatasetBuilder
.
Some datasets are not perfectly clean and contain some corrupt data (for example, the images are in JPEG files but some are invalid JPEG). These examples should be skipped, but leave a note in the dataset description how many examples were dropped and why.
Some datasets provide a set of URLs for individual records or features (for example, URLs to various images around the web) that may or may not exist anymore. These datasets are difficult to version properly because the source data is unstable (URLs come and go).
If the dataset is inherently unstable (that is, if multiple runs over time
may not yield the same data), mark the dataset as unstable by adding a
class constant to the DatasetBuilder
:
UNSTABLE = "<why this dataset is unstable">
. For example,
UNSTABLE = "Downloads URLs from the web."
Some datasets may have variants that should be exposed, or options for how the data is preprocessed. These configurations can be separated into 2 categories:
- "Heavy": Configuration that affects how the data is written to disk. We'll call this "heavy" configuration.
- "Light": Configuration that affects runtime preprocessing (i.e.
configuration that can be done in a
tf.data
input pipeline). We'll call this "light" configuration.
Heavy configuration affects how the data is written to disk. For example, for
text datasets, different TextEncoder
s and vocabularies affect the token ids
that are written to disk.
Heavy configuration is done through
tfds.core.BuilderConfig
s:
- Define your own configuration object as a subclass of
tfds.core.BuilderConfig
. For example,MyDatasetConfig
. - Define the
BUILDER_CONFIGS
class member inMyDataset
that listsMyDatasetConfig
s that the dataset exposes. - Use
self.builder_config
inMyDataset
to configure data generation. This may include setting different values in_info()
or changing download data access.
Datasets with BuilderConfig
s have a name and version per config,
so the fully qualified name of a particular variant would be
dataset_name/config_name
(for example, "lm1b/bytes"
). The config defaults
to the first one in BUILDER_CONFIGS
(for example "lm1b
" defaults to
"lm1b/plain_text"
).
See Lm1b
for an example of a dataset that uses BuilderConfig
s.
For situations where alterations could be made
on-the-fly in the tf.data
input pipeline, add keyword arguments to the
MyDataset
constructor, store the values in member variables,
and then use them later. For example, override _as_dataset()
, call super()
to get the base tf.data.Dataset
, and then do additional transformations
based on the member variables.
Note that most datasets will find the current set of
tfds.features.FeatureConnector
s
sufficient, but sometimes a new one may need to be defined.
Note: If you need a new FeatureConnector
not present in the default set and
are planning to submit it to tensorflow/datasets
, please open a
new issue
on GitHub with your proposal.
tfds.features.FeatureConnector
s
in DatasetInfo
correspond to the elements returned in the
tf.data.Dataset
object. For instance, with:
tfds.DatasetInfo(features=tfds.features.FeatureDict({
'input': tfds.features.Image(),
'output': tfds.features.Text(encoder=tfds.text.ByteEncoder()),
'metadata': {
'description': tfds.features.Text(),
'img_id': tf.int32,
},
}))
The items in tf.data.Dataset
object would look like:
{
'input': tf.Tensor(shape=(None, None, 3), dtype=tf.uint8),
'output': tf.Tensor(shape=(None,), dtype=tf.int32), # Sequence of token ids
'metadata': {
'description': tf.Tensor(shape=(), dtype=tf.string),
'img_id': tf.Tensor(shape=(), dtype=tf.int32),
},
}
The tfds.features.FeatureConnector
object abstracts away how the feature is
encoded on disk from how it is presented to the user. Below is a
diagram showing the abstraction layers of the dataset and the transformation
from the raw dataset files to the tf.data.Dataset
object.
To create your own feature connector, subclass tfds.features.FeatureConnector
and implement the abstract methods:
get_tensor_info()
: Indicates the shape/dtype of the tensor(s) returned bytf.data.Dataset
encode_example(input_data)
: Defines how to encode the data given in the generator_generate_examples()
into atf.train.Example
compatible datadecode_example
: Defines how to decode the data from the tensor read fromtf.train.Example
into user tensor returned bytf.data.Dataset
.- (optionally)
get_serialized_info()
: If the info returned byget_tensor_info()
is different from how the data are actually written on disk, then you need to overwriteget_serialized_info()
to match the specs of thetf.train.Example
-
If your connector only contains one value, then the
get_tensor_info
,encode_example
, anddecode_example
methods can directly return single value (without wrapping it in a dict). -
If your connector is a container of multiple sub-features, the easiest way is to inherit from
tfds.features.FeaturesDict
and use thesuper()
methods to automatically encode/decode the sub-connectors.
Have a look at
tfds.features.FeatureConnector
for more details and the
features package
for more examples.
If you'd like to share your work with the community, you can check in your
dataset implementation to tensorflow/datasets
. Thanks for thinking of
contributing!
Before you send your pull request, follow these last few steps:
All subclasses of tfds.core.DatasetBuilder
are automatically registered
when their module is imported such that they can be accessed through
tfds.builder
and tfds.load
.
If you're contributing the dataset to tensorflow/datasets
, add the module
import to its subdirectory's __init__.py
(e.g. image/__init__.py
.
If you're contributing the dataset to tensorflow/datasets
, add a checksums
file for the dataset. On first download, the DownloadManager
will
automatically add the sizes and checksums for all downloaded URLs to that file.
This ensures that on subsequent data generation, the downloaded files are
as expected.
touch tensorflow_datasets/url_checksums/my_new_dataset.txt
Run download_and_prepare
locally to ensure that data generation works:
# default data_dir is ~/tensorflow_datasets
python -m tensorflow_datasets.scripts.download_and_prepare \
--register_checksums \
--datasets=my_new_dataset
Note that the --register_checksums
flag must only be used while in development.
Copy in the contents of the dataset_info.json
file(s) to a GitHub gist and link to it in your pull request.
It's important that DatasetInfo.citation
includes a good citation for the
dataset. It's hard and important work contributing a dataset to the community
and we want to make it easy for dataset users to cite the work.
If the dataset's website has a specifically requested citation, use that (in BibTex format).
If the paper is on arXiv, find it there and click the
bibtex
link on the right-hand side.
If the paper is not on arXiv, find the paper on
Google Scholar and click the double-quotation mark
underneath the title and on the popup, click BibTeX
.
If there is no associated paper (for example, there's just a website), you can
use the
BibTeX Online Editor to create a custom
BibTeX entry (the drop-down menu has an Online
entry type).
Most datasets in TFDS should have a unit test and your reviewer may ask you to add one if you haven't already. See the testing section below.
Follow the PEP 8 Python style guide, except TensorFlow uses 2 spaces instead of 4. Please conform to the Google Python Style Guide,
Most importantly, use
tensorflow_datasets/oss_scripts/lint.sh
to ensure your code is properly formatted. For example, to lint the image
directory:
./oss_scripts/lint.sh tensorflow_datasets/image
See TensorFlow code style guide for more information.
Send the pull request for review.
When creating the pull request, fill in the areas for the name, issue reference,
and GitHub Gist link. When using the checklist, replace each [ ]
with [x]
to
mark it off.
You can use the tfds
API to define your own custom datasets outside of the
tfds
repository. The instructions are mainly the same as above, with some
minor adjustments, documented below.
For security and reproducibility when redistributing a dataset, tfds
contains
URL checksums for all dataset downloads in
tensorflow_datasets/url_checksums
.
You can register an external checksums directory by calling
tfds.download.add_checksums_dir('/path/to/checksums_dir')
in your code, so
that users of your dataset automatically use your checksums.
To create this checksum file the first time, you can use the
tensorflow_datasets.scripts.download_and_prepare
script and pass the flags
--register_checksums --checksums_dir=/path/to/checksums_dir
.
For testing, instead of using the default
fake example directory
you can define your own by setting the EXAMPLE_DIR
property of
tfds.testing.DatasetBuilderTestCase
:
class MyDatasetTest(tfds.testing.DatasetBuilderTestCase):
EXAMPLE_DIR = 'path/to/fakedata'
Some datasets are so large as to require multiple machines to download and generate. We support this use case using Apache Beam. Please read the Beam Dataset Guide to get started.
tfds.testing.DatasetBuilderTestCase
is a base TestCase
to fully exercise a
dataset. It uses "fake examples" as test data that mimic the structure of the
source dataset.
The test data should be put in
testing/test_data/fake_examples/
under the my_dataset
directory and should mimic the source dataset artifacts
as downloaded and extracted. It can be created manually or automatically with a
script
(example script).
If you're using automation to generate the test data, please include that script
in testing
.
Make sure to use different data in your test data splits, as the test will fail if your dataset splits overlap.
The test data should not contain any copyrighted material. If in doubt, do not create the data using material from the original dataset.
import tensorflow as tf
from tensorflow_datasets import my_dataset
import tensorflow_datasets.testing as tfds_test
class MyDatasetTest(tfds_test.DatasetBuilderTestCase):
DATASET_CLASS = my_dataset.MyDataset
SPLITS = { # Expected number of examples on each split from fake example.
"train": 12,
"test": 12,
}
# If dataset `download_and_extract`s more than one resource:
DL_EXTRACT_RESULT = {
"name1": "path/to/file1", # Relative to fake_examples/my_dataset dir.
"name2": "file2",
}
if __name__ == "__main__":
tfds_test.test_main()
You can run the test as you proceed to implement MyDataset
.
If you go through all the steps above, it should pass.