All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- The base
Logger
class expects more information to be passed to theas_dataset
method. This should only be relevant to people who have implemented and registered customeLogger
class(es).
- Support for community datasets on GCS.
- [API]
tfds.builder_from_directory
andtfds.builder_from_directories
, see https://www.tensorflow.org/datasets/external_tfrecord#directly_from_folder. - [API] Dash ("-") support in split names.
- [API]
file_format
argument todownload_and_prepare
method, allowing user to specify an alternative file format to store prepared data (e.g. "riegeli"). - [API]
file_format
toDatasetInfo
string representation. - [API] Expose the return value of Beam pipelines. This allows for users to read the Beam metrics.
- [API] Expose Feature
tf_example_spec
to public. - [API]
doc
kwarg onFeature
s, to describe a feature. - [Documentation] Features description is shown on TFDS Catalog.
- [Documentation] More metadata about HuggingFace datasets in TFDS catalog.
- [Performance] Parallel load of metadata files.
- [Testing] TFDS tests are now run using GitHub actions - misc improvements such as caching and sharding.
- [Testing] Improvements to MockFs.
- New datasets.
- [API]
num_shards
is now optional in the shard name.
- TFDS pathlib API, migrated to a self-contained
etils.epath
(see https://github.com/google/etils).
- Various datasets.
- Dataset builders that are defined adhoc (e.g. in Colab).
- Better
DatasetNotFoundError
messages. - Don't set
deterministic
on a global level but locally in interleave, so it only apply to interleave and not all transformations. - Google drive downloader.
- [API]
split=tfds.split_for_jax_process('train')
(alias oftfds.even_splits('train', n=jax.process_count())[jax.process_index()]
). - [Documentation] update.
- Import bug on Windows (#3709).
- [API] Better split API:
- Splits can be selected using shards:
split='train[3shard]'
. - Underscore supported in numbers for better readability:
split='train[:500_000]'
. - Select the union of all splits with
split='all'
. tfds.even_splits
is more precise and flexible:- Return splits exactly of the same size when passed
tfds.even_splits('train', n=3, drop_remainder=True)
. - Works on subsplits
tfds.even_splits('train[:75%]', n=3)
or even nested. - Can be composed with other splits:
tfds.even_splits('train', n=3)[0] + 'test'
.
- Return splits exactly of the same size when passed
- Splits can be selected using shards:
- [API]
serialize_example
/deserialize_example
methods on features to encode/decode example to proto:example_bytes = features.serialize_example(example_data)
. - [API]
Audio
feature now supportsencoding='zlib'
for better compression. - [API] Features specs are exposed in proto for better compatibility with other languages.
- [API] Create beam pipeline using TFDS as input with tfds.beam.ReadFromTFDS.
- [API] Support setting the file formats in
tfds build --file_format=tfrecord
. - [API] Typing annotations exposed in
tfds.typing
. - [API]
tfds.ReadConfig
has a newassert_cardinality=False
argument to disable cardinality. - [API]
tfds.display_progress_bar(True)
for functional control. - [API] DatasetInfo exposes
.release_notes
. - Support for huge number of shards (>99999).
- [Performance] Faster dataset generation (using tfrecords).
- [Testing] Mock dataset now supports nested datasets
- [Testing] Customize the number of sub examples
- [Documentation] Community datasets: https://www.tensorflow.org/datasets/community_catalog/overview.
- [Documentation] Guide on TFDS and determinism.
- [RLDS] Support for nested datasets features.
- [RLDS] New datasets: Robomimic, D4RL Ant Maze, RLU Real World RL, and RLU Atari with ordered episodes.
- New datasets.
- Python 3.6 support: this is the last version of TFDS supporting Python 3.6. Future versions will use Python 3.7.
- Misc bugs.
- [API]
PartialDecoding
support, to decode only a subset of the features (for performances). - [API]
tfds.features.LabeledImage
for semantic segmentation (like image but with additionalinfo.features['image_label'].name
label metadata). - [API] float32 support for
tfds.features.Image
(e.g. for depth map). - [API] Loading datasets from files now supports custom
tfds.features.FeatureConnector
. - [API] All FeatureConnector can now have a
None
dimension anywhere (previously restricted to the first position). - [API]
tfds.features.Tensor()
can have arbitrary number of dynamic dimension (Tensor(..., shape=(None, None, 3, None)
)). - [API]
tfds.features.Tensor
can now be serialised as bytes, instead of float/int values (to allow better compression):Tensor(..., encoding='zlib')
. - [API] Support for datasets with
None
intfds.as_numpy
. - Script to add TFDS metadata files to existing TF-record (see doc).
- [TESTING]
tfds.testing.mock_data
now supports:- non-scalar tensors with dtype
tf.string
; builder_from_files
and path-based community datasets.
- non-scalar tensors with dtype
- [Documentation] Catalog now exposes links to KnowYourData visualisations.
- [Documentation] Guide on common implementation gotchas.
- Many new reinforcement learning datasets.
- [API] Dataset generated with
disable_shuffling=True
are now read in generation order.
- File format automatically restored (for datasets generated with
tfds.builder(..., file_format=)
). - Dynamically set number of worker threads during extraction.
- Update progression bar during download even if downloads are cached.
- Misc bug fixes.
- [API]
dataset.info.splits['train'].num_shards
to expose the number of shards to the user. - [API]
tfds.features.Dataset
to have a field containing sub-datasets (e.g. used in RL datasets). - [API] dtype and
tf.uint16
support intfds.features.Video
. - [API]
DatasetInfo.license
field to add redistributing information. - [API]
.copy
,.format
methods to GPath objects. - [Performances]
tfds.benchmark(ds)
(compatible with any iterator, not justtf.data
, better colab representation). - [Performances] Faster
tfds.as_numpy()
(avoid extratf.Tensor
<>np.array
copy). - [Testing] Support for custom
BuilderConfig
inDatasetBuilderTest
. - [Testing]
DatasetBuilderTest
now has adummy_data
class property which can be used insetUpClass
. - [Testing]
add_tfds_id
and cardinality support totfds.testing.mock_data
. - [Documentation] Better
tfds.as_dataframe
visualisation (Sequence, ragged tensor, semantic masks withuse_colormap
). - [Experimental] Community datasets support. To allow dynamically import datasets defined outside the TFDS repository.
- [Experimental] Hugging-face compatibility wrapper to use Hugging-face datasets directly in TFDS.
- [Experimental] Riegelli format support.
- [Experimental]
DatasetInfo.disable_shuffling
to force examples to be read in generation order. - New datasets.
- Many bugs.
- [CLI]
tfds build
to the CLI. See documentation. - [API]
tfds.features.Dataset
to represent nested datasets. - [API]
tfds.ReadConfig(add_tfds_id=True)
to add a unique id to the exampleex['tfds_id']
(e.g.b'train.tfrecord-00012-of-01024__123'
). - [API]
num_parallel_calls
option totfds.ReadConfig
to overwrite to defaultAUTOTUNE
option. - [API]
tfds.ImageFolder
support fortfds.decode.SkipDecoder
. - [API] Multichannel audio support to
tfds.features.Audio
. - [API]
try_gcs
totfds.builder(..., try_gcs=True)
- Better
tfds.as_dataframe
visualization (ffmpeg video if installed, bounding boxes,...). - [TESTING] Allow
max_examples_per_splits=0
intfds build --max_examples_per_splits=0
to test_split_generators
only (without_generate_examples
). - New datasets.
- [API] DownloadManager now returns Pathlib-like objects.
- [API] Simpler
BuilderConfig
definition: classVERSION
andRELEASE_NOTES
are applied to allBuilderConfig
. Config description is now optional. - [API] To guarantee better deterministic, new validations are performed on the
keys when creating a dataset (to avoid filenames as keys (non-deterministic)
and restrict key to
str
,bytes
andint
). New errors likely indicates an issue in the dataset implementation. - [API]
tfds.core.benchmark
now returns apd.DataFrame
(instead of adict
). - [API]
tfds.units
is not visible anymore from the public API. - Datasets updates.
- Configs for all text datasets. Only plain text version is kept. For example:
multi_nli/plain_text
->multi_nli
.
- [API] Datasets returned by
tfds.as_numpy
are compatible withlen(ds)
. - Support 0-len sequence with images of dynamic shape (Fix #2616).
- Progression bar correctly updated when copying files.
- Better debugging and error message (e.g. human readable size,...).
- Many bug fixes (GPath consistency with pathlib, s3 compatibility, TQDM visual artifacts, GCS crash on windows, re-download when checksums updated, ...).
- It is now easier to create datasets outside TFDS repository (see our updated dataset creation guide).
- When generating a dataset, if download fails for any reason, it is now possible to manually download the data. See doc.
tfds.core.as_path
to create pathlib.Path-like objects compatible with GCS (e.g.tfds.core.as_path('gs://my-bucket/labels.csv').read_text()
).verify_ssl=
option totfds.download.DownloadConfig
to disable SSH certificate during download.- New datasets.
- All dataset inherit from
tfds.core.GeneratorBasedBuilder
. Converting a dataset to beam now only require changing_generate_examples
(see example and doc). _split_generators
should now returns{'split_name': self._generate_examples(), ...}
(but current datasets are backward compatible).- Better
pathlib.Path
,os.PathLike
compatibility:dl_manager.manual_dir
now returns a pathlib-Like object. Example:Note: Othertext = (dl_manager.manual_dir / 'downloaded-text.txt').read_text()
dl_manager.download
,.extract
,... will return pathlib-like objects in future versions.FeatureConnector
,... and most functions should acceptPathLike
objects. Let us know if some functions you need are missing. --record_checksums
now assume the new dataset-as-folder model.
tfds.core.SplitGenerator
,tfds.core.BeamBasedBuilder
are deprecated and will be removed in a future version.
BuilderConfig
are now compatible with Beam datasets #2348tfds.features.Images
can accept encodedbytes
images directly (useful when used withimg_name, img_bytes = dl_manager.iter_archive('images.zip')
).- Doc API now show deprecated methods, abstract methods to overwrite are now documented.
- You can generate
imagenet2012
with only a single split (e.g. only the validation data). Other split will be skipped if not present.
tfds.load
when generation code isn't present.- GCS compatibility.
- Dataset-as-folder: Dataset can now be self-contained module in a folder with checksums, dummy data,... This simplify implementing datasets outside the TFDS repository.
tfds.load
can now load dataset without using the generation class. Sotfds.load('my_dataset:1.0.0')
can work even ifMyDataset.VERSION == '2.0.0'
(See #2493).- TFDS CLI (see https://www.tensorflow.org/datasets/cli for detail).
tfds.testing.mock_data
does not require metadata files anymore!tfds.as_dataframe(ds, ds_info)
with custom visualisation (example).tfds.even_splits
to generate subsplits (e.g.tfds.even_splits('train', n=3) == ['train[0%:33%]', 'train[33%:67%]', ...]
.DatasetBuilder.RELEASE_NOTES
property.tfds.features.Image
now supports PNG with 4-channels.tfds.ImageFolder
now supports custom shape, dtype.- Downloaded URLs are available through
MyDataset.url_infos
. skip_prefetch
option totfds.ReadConfig
.as_supervised=True
support fortfds.show_examples
,tfds.as_dataframe
.- tfds.features can now be saved/loaded, you may have to overwrite
FeatureConnector.from_json_content
and
FeatureConnector.to_json_content
to support this feature. - Script to detect dead-urls.
- New datasets.
tfds.as_numpy()
now returns an iterable which can be iterated multiple times. To migrate:next(ds)
->next(iter(ds))
.- Rename
tfds.features.text.Xyz
->tfds.deprecated.text.Xyz
.
DatasetBuilder.IN_DEVELOPMENT
property.tfds.core.disallow_positional_args
(should use Py3*,
instead).- Testing against TF 1.15. Requires Python 3.6.8+.
- Better archive extension detection for
dl_manager.download_and_extract
. - Fix
tfds.__version__
in TFDS nightly to be PEP440 compliant - Fix crash when GCS not available.
- Improved open-source workflow, contributor guide, documentation.
- Many other internal cleanups, bugs, dead code removal, py2->py3 cleanup, pytype annotations,...
- Datasets updates.
- Issue with GCS on Windows.
- [API]
tfds.ImageFolder
andtfds.TranslateFolder
to easily create custom datasets with your custom data. - [API]
tfds.ReadConfig(input_context=)
to shard dataset, for better multi-worker compatibility (#1426). - [API] The default
data_dir
can be controlled by theTFDS_DATA_DIR
environment variable. - [API] Better usability when developing datasets outside TFDS: downloads are always cached, checksums are optional.
- Scripts to help deployment/documentation (Generate catalog documentation, export all metadata files, ...).
- [Documentation] Catalog display images (example).
- [Documentation] Catalog shows which dataset have been recently added and are
only available in
tfds-nightly
nights_stay. - [API]
tfds.show_statistics(ds_info)
to display FACETS OVERVIEW. Note: This require the dataset to have been generated with the statistics.
tfds.features.text
encoding API. Please use tensorflow_text instead.
tfds.load('image_label_folder')
in favor of the more user-friendlytfds.ImageFolder
.
- Fix deterministic example order on Windows when path was used as key (this only impacts a few datasets). Now example order should be the same on all platforms.
- Misc performances improvements for both generation and reading (e.g. use
__slot__
, fix parallelisation bug intf.data.TFRecordReader
, ...). - Misc fixes (typo, types annotations, better error messages, fixing dead links, better windows compatibility, ...).
- [API]
tfds.builder_cls(name)
to access a DatasetBuilder class by name - [API]
info.split['train'].filenames
for access to the tf-record files. - [API]
tfds.core.add_data_dir
to register an additional data dir. - [Testing] Support for custom decoders in
tfds.testing.mock_data
. - [Documentation] Shows which datasets are only present in
tfds-nightly
. - [Documentation] Display images for supported datasets.
- Rename
tfds.core.NamedSplit
,tfds.core.SplitBase
->tfds.Split
. Nowtfds.Split.TRAIN
,... are instance oftfds.Split
. - Rename
interleave_parallel_reads
->interleave_cycle_length
fortfds.ReadConfig
. - Invert ds, ds_info argument orders for
tfds.show_examples
.
tfds.features.text
encoding API. Please usetensorflow_text
instead.
num_shards
argument fromtfds.core.SplitGenerator
. This argument was ignored as shards are automatically computed.- Most
ds.with_options
which where applied by TFDS. Now usetf.data
default.
- Better error messages.
- Windows compatibility.
DownloadManager
is now pickable (can be used inside Beam pipelines).tfds.features.Audio
:- Support float as returned value.
- Expose sample_rate through
info.features['audio'].sample_rate
. - Support for encoding audio features from file objects.
- More datasets.
- New
image_classification
section. Some datasets have been move there fromimages
. DownloadConfig
does not append the dataset name anymore (manual data should be in<manual_dir>/
instead of<manual_dir>/<dataset_name>/
).- Tests now check that all
dl_manager.download
urls has registered checksums. To opt-out, addSKIP_CHECKSUMS = True
to yourDatasetBuilderTestCase
. tfds.load
now always returnstf.compat.v2.Dataset
. If you're using still usingtf.compat.v1
:- Use
tf.compat.v1.data.make_one_shot_iterator(ds)
rather thands.make_one_shot_iterator()
. - Use
isinstance(ds, tf.compat.v2.Dataset)
instead ofisinstance(ds, tf.data.Dataset)
.
- Use
- The
tfds.features.text
encoding API is deprecated. Please use tensorflow_text instead. num_shards
argument oftfds.core.SplitGenerator
is currently ignored and will be removed in the next version.
- Legacy mode
tfds.experiment.S3
has been removed in_memory
argument has been removed fromas_dataset
/tfds.load
(small datasets are now auto-cached).tfds.Split.ALL
.
- Various bugs, better error messages, documentation improvements.
- Datasets expose
info.dataset_size
andinfo.download_size
. - Auto-caching small datasets.
- Datasets expose their cardinality
num_examples = tf.data.experimental.cardinality(ds)
(Requires tf-nightly or TF >= 2.2.0) - Get the number of example in a sub-splits with:
info.splits['train[70%:]'].num_examples
- All datasets generated with 2.1.0 cannot be loaded with previous version
(previous datasets can be read with
2.1.0
however).
in_memory
argument is deprecated and will be removed in a future version.
- Several new datasets. Thanks to all the contributors!
- Support for nested
tfds.features.Sequence
andtf.RaggedTensor
- Custom
FeatureConnector
s can override thedecode_batch_example
method for efficient decoding when wrapped inside atfds.features.Sequence(my_connector)
. - Beam datasets can use a
tfds.core.BeamMetadataDict
to store additional metadata computed as part of the Beam pipeline. - Beam datasets'
_split_generators
accepts an additionalpipeline
kwargs to define a pipeline shared between all splits.
- The default versions of all datasets are now using the S3 slicing API. See the guide for details.
shuffle_files
defaults to False so that dataset iteration is deterministic by default. You can customize the reading pipeline, including shuffling and interleaving, through the newread_config
parameter intfds.load
.urls
kwargs renamedhomepage
inDatasetInfo
- Python2 support: this is the last version of TFDS that will support Python 2. Going forward, we'll only support and test against Python 3.
- The previous split API is still available, but is deprecated. If you wrote
DatasetBuilder
s outside the TFDS repository, please make sure they do not useexperiments={tfds.core.Experiment.S3: False}
. This will be removed in the next version, as well as thenum_shards
kwargs fromSplitGenerator
.
- Various other bug fixes and performance improvements. Thank you for all the reports and fixes!
- Misc bugs and performance improvements.
- Add
shuffle_files
argument totfds.load
function. The semantic is the same as inbuilder.as_dataset
function, which for now means that by default, files will be shuffled forTRAIN
split, and not for other splits. Default behaviour will change to always be False at next major release. - Most datasets now support the new S3 API (documentation).
- Support for uint16 PNG images.
- AFLW2000-3D
- Amazon_US_Reviews
- binarized_mnist
- BinaryAlphaDigits
- Caltech Birds 2010
- Coil100
- DeepWeeds
- Food101
- MIT Scene Parse 150
- RockYou leaked password
- Stanford Dogs
- Stanford Online Products
- Visual Domain Decathlon
- Crash while shuffling on Windows
- Various documentation improvements
in_memory
option to cache small dataset in RAM.- Better sharding, shuffling and sub-split.
- It is now possible to add arbitrary metadata to
tfds.core.DatasetInfo
which will be stored/restored with the dataset. Seetfds.core.Metadata
. - Better proxy support, possibility to add certificate.
decoders
kwargs to override the default feature decoding (guide).
- downsampled_imagenet.
- patch_camelyon.
- coco 2017 (with and without panoptic annotations).
- uc_merced.
- trivia_qa.
- super_glue.
- so2sat.
- snli.
- resisc45.
- pet_finder.
- mnist_corrupted.
- kitti.
- eurosat.
- definite_pronoun_resolution.
- curated_breast_imaging_ddsm.
- clevr.
- bigearthnet.
- Apache Beam support.
- Direct GCS access for MNIST (with
tfds.load('mnist', try_gcs=True)
). - More datasets.
- Option to turn off tqdm bar (
tfds.disable_progress_bar()
).
- Subsplit do not depends on the number of shard anymore (tensorflow#292).
- Various bugs.
- Dataset
celeb_a_hq
.
- Bug #52 that was putting the process in Eager mode by default.
- 25 datasets.
- Ready to be used
tensorflow-datasets
.