Skip to content

Tags: wayneshix/datasets

Tags

v4.3.0

Toggle v4.3.0's commit message
Automated documentation update.

PiperOrigin-RevId: 372407159

v4.2.0

Toggle v4.2.0's commit message
Update TFDS to 4.2.0

API:

 * Add `tfds build` to the CLI. See [documentation](https://www.tensorflow.org/datasets/cli#tfds_build_download_and_prepare_a_dataset).
 * DownloadManager now returns [Pathlib-like](https://docs.python.org/3/library/pathlib.html#basic-use) objects
 * Datasets returned by `tfds.as_numpy` are compatible with `len(ds)`
 * New `tfds.features.Dataset` to represent nested datasets
 * Add `tfds.ReadConfig(add_tfds_id=True)` to add a unique identifiant to the example `ex['tfds_id']` (e.g. `b'train.tfrecord-00012-of-01024__123'`)
 * Add `num_parallel_calls` option to `tfds.ReadConfig` to overwrite to default `AUTOTUNE` option
 * `tfds.ImageFolder` now support `tfds.decode.SkipDecoder`
 * Add multichannel audio support to `tfds.features.Audio`
 * Better `tfds.as_dataframe` visualization (ffmpeg video if installed, bounding boxes,...)
 * Add `try_gcs` to `tfds.builder(..., try_gcs=True)`
 * Simpler `BuilderConfig` definition: global `VERSION` and `RELEASE_NOTES` are applied to all `BuilderConfig`. Config description is now optional.

Breaking compatibility changes:

* Removed non-plain text config of text datasets and remove config: `multi_nli/plain_text` -> `multi_nli`
* To guarantee better deterministic, new validations are performed on the keys when creating a dataset (to avoid filenames as keys (non-deterministic) and restrict key to `str`, `bytes` and `int`). New errors likely indicates an issue in the dataset implementation.
* `tfds.core.benchmark` now returns a `pd.DataFrame` (instead of a `dict`)
* `tfds.units` is not visible anymore from the public API

Bug fixes:

* Support 0-len sequence with images of dynamic shape (Fix tensorflow#2616)
* Progression bar correctly updated when copying files.
* Many bug fixes (GPath consistency with pathlib, s3 compatibility, TQDM visual artifacts, GCS crash on windows, re-download when checksums updated,...)
* Better debugging and error message (e.g. human readable size,...)
* Allow `max_examples_per_splits=0` in `tfds build --max_examples_per_splits=0` to test `_split_generators` only (without `_generate_examples`).

And of course, new datasets and many datasets updates.

Thank you the community for their many valuable contributions and to supporting us in this project!!!

PiperOrigin-RevId: 350344016

v4.1.0

Toggle v4.1.0's commit message
Update TFDS to v4.1.0

* It is now possible to manually download the data for all datasets (if the automated download fail for any reason). See [doc](https://www.tensorflow.org/datasets/overview#load_a_dataset).
* Simplification of the dataset creation API.
  * We've made it is easier to create datasets outside TFDS repository (see our updated [dataset creation guide](https://www.tensorflow.org/datasets/add_dataset)).
  * `_split_generators` should now returns `{'split_name': self._generate_examples(), ...}` (but current datasets are backward compatible).
  * All dataset inherit from `tfds.core.GeneratorBasedBuilder`. Converting a dataset to beam now only require changing `_generate_examples` (see [example and doc](https://www.tensorflow.org/datasets/beam_datasets#instructions)).
  * `tfds.core.SplitGenerator`, `tfds.core.BeamBasedBuilder` are deprecated and will be removed in future version.

* Better `pathlib.Path`, `os.PathLike` compatibility:
  * `dl_manager.manual_dir` now returns a pathlib-Like object. Example:

  ```python
  text = (dl_manager.manual_dir / 'downloaded-text.txt').read_text()
  ```

  * Note: Other `dl_manager.download`, `.extract`,... will return pathlib-like objects in future versions
  * `FeatureConnector`,... and most functions should accept `PathLike` objects. Let us know if some functions you need are missing.
  * Add a `tfds.core.as_path` to create pathlib.Path-like objects compatible with GCS (e.g. `tfds.core.as_path('gs://my-bucket/labels.csv').read_text()`).

* Other bug fixes and improvement. E.g.
  * Add `verify_ssl=` option to `tfds.download.DownloadConfig` to disable SSH certificate during download.
  * `BuilderConfig` are now compatible with Beam datasets tensorflow#2348
  * `--record_checksums` now assume the new dataset-as-folder model
  * `tfds.features.Images` can accept encoded `bytes` images directly (useful when used with `img_name, img_bytes = dl_manager.iter_archive('images.zip')`).
  * Doc API now show deprecated methods, abstract methods to overwrite are now documented.
  * You can generate `imagenet2012` with only a single split (e.g. only the validation data). Other split will be skipped if not present.
* And of course, new datasets...

Thank you to all our contributors for improving TFDS!

PiperOrigin-RevId: 340614460

v4.0.1

Toggle v4.0.1's commit message
Update TF to 4.0.1

Fix `tfds.load` when generation code isn't present and improve GCS compatibility.

Thanks @carlthome for reporting and fixing the issue.

PiperOrigin-RevId: 336306487

v4.0.0

Toggle v4.0.0's commit message
Update TFDS version to 4.0.0

API changes, new features:

* Dataset-as-folder: Dataset can now be self-contained module in a folder with checksums, dummy data,... This simplify implementing datasets outside the TFDS repository.
* `tfds.load` can now load dataset without using the generation class. So `tfds.load('my_dataset:1.0.0')` can work even if `MyDataset.VERSION == '2.0.0'` (See tensorflow#2493).
* Add a new TFDS CLI (see https://www.tensorflow.org/datasets/cli for detail)
* `tfds.testing.mock_data` does not require metadata files anymore!
* Add `tfds.as_dataframe(ds, ds_info)` with custom visualisation ([example](https://www.tensorflow.org/datasets/overview#tfdsas_dataframe))
* Add `tfds.even_splits` to generate subsplits (e.g. `tfds.even_splits('train', n=3) == ['train[0%:33%]', 'train[33%:67%]', ...]`
* Add new `DatasetBuilder.RELEASE_NOTES` property
* tfds.features.Image now supports PNG with 4-channels
* `tfds.ImageFolder` now supports custom shape, dtype
* Downloaded URLs are available through `MyDataset.url_infos`
* Add `skip_prefetch` option to `tfds.ReadConfig`
* `as_supervised=True` support for `tfds.show_examples`, `tfds.as_dataframe`

Breaking compatible changes:

* `tfds.as_numpy()` now returns an iterable which can be iterated multiple times. To migrate `next(ds)` -> `next(iter(ds))`
* Rename `tfds.features.text.Xyz` -> `tfds.deprecated.text.Xyz`
* Remove `DatasetBuilder.IN_DEVELOPMENT` property
* Remove `tfds.core.disallow_positional_args` (should use Py3 `*, ` instead)
* tfds.features can now be saved/loaded, you may have to overwrite [FeatureConnector.from_json_content](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/FeatureConnector?version=nightly#from_json_content) and `FeatureConnector.to_json_content` to support this feature.
* Stop testing against TF 1.15. Requires Python 3.6.8+.

Other bug fixes:

* Better archive extension detection for `dl_manager.download_and_extract`
* Fix `tfds.__version__` in TFDS nightly to be PEP440 compliant
* Fix crash when GCS not available
* Script to detect dead-urls
* Improved open-source workflow, contributor guide, documentation
* Many other internal cleanups, bugs, dead code removal, py2->py3 cleanup, pytype annotations,...

And of course, new datasets, datasets updates.

A gigantic thanks to our community which has helped us debugging issues and with the implementation of many features, especially vijayphoenix@ which has been one of our main contributor for this release.

PiperOrigin-RevId: 335667395

v3.2.1

Toggle v3.2.1's commit message
Update TFDS to 3.2.1

v3.2.0

Toggle v3.2.0's commit message
Update TFDS version to 3.2.0

API:

 * Add a `tfds.ImageFolder` and `tfds.TranslateFolder` to easily create custom datasets with your custom data.
 * Add a `tfds.ReadConfig(input_context=)` to shard dataset, for better multi-worker compatibility (tensorflow#1426).
 * The default `data_dir` can be controlled by the `TFDS_DATA_DIR` environment variable.
 * Better usability when developing datasets outside TFDS
   * Downloads are always cached
   * Checksum are optional
 * Added a `tfds.show_statistics(ds_info)` to display [FACETS OVERVIEW](https://pair-code.github.io/facets/). Note: This require the dataset to have been generated with the statistics.
 * Open source various scripts to help deployment/documentation (Generate catalog documentation, export all metadata files,...)

Documentation:

 * Catalog display images ([example](https://www.tensorflow.org/datasets/catalog/sun397#sun397standard-part2-120k))
 * Catalog shows which dataset have been recently added and are only available in `tfds-nightly` <span class="material-icons">nights_stay</span>

Breaking compatibility change:

 * Fix deterministic example order on Windows when path was used as key (this only impact a few datasets). Now example order should be the same on all platforms.
 * Remove `tfds.load('image_label_folder')` in favor of the more user-friendly `tfds.ImageFolder`

Other:

 * Various performances improvements for both generation and reading (e.g. use `__slot__`, fix parallelisation bug in `tf.data.TFRecordReader`,...)
 * Various fixes (typo, types annotations, better error messages, fixing dead links, better windows compatibility,...)

PiperOrigin-RevId: 320672697

v3.1.0

Toggle v3.1.0's commit message
Update version to `3.1.0`

PiperOrigin-RevId: 309069766

v3.0.0

Toggle v3.0.0's commit message
Update TFDS version

Breaking changes:
* Legacy mode `tfds.experiment.S3` has been removed
* New  `tfds.image_classification` section and move there some datasets from `tfds.images`.
* `in_memory` argument removed from `as_dataset`/`tfds.load` (small datasets are auto-cached).
* DownloadConfig do not append the dataset name anymore (manual data should be in `<manual_dir>/` instead of `<manual_dir>/<dataset_name>/`)
* Tests now check that all `dl_manager.download` urls has registered checksums. To opt-out, add `SKIP_CHECKSUMS
 = True` to your `DatasetBuilderTestCase`.
* `tfds.load` now always returns `tf.compat.v2.Dataset`. If you're using still using `tf.compat.v1`:
   * Use `tf.compat.v1.data.make_one_shot_iterator(ds)` rather than `ds.make_one_shot_iterator()`
   * Use `isinstance(ds, tf.compat.v2.Dataset)` instead of `isinstance(ds, tf.data.Dataset)`
* `tfds.Split.ALL` has been removed from the API.

Future breaking change:
* The tfds.features.text encoding API is deprecated. Please use [tensorflow_text](https://www.tensorflow.org/tutorials/tensorflow_text/intro) instead.
* `num_shards` argument of `tfds.core.SplitGenerator` is currently ignored and will be removed in the next version.

Features:
* `DownloadManager` is now pickable (can be used inside Beam pipelines)
* `tfds.features.Audio`:
  * Support float as returned value
  * Expose sample_rate through `info.features['audio'].sample_rate`
  * Support for encoding audio features from file objects
* Various bug fixes, better error messages, documentation improvements
* More datasets

Thank you to all our contributors for helping us make TFDS better for everyone!

PiperOrigin-RevId: 306768189

v2.1.0

Toggle v2.1.0's commit message
Update TFDS to 2.1.0

PiperOrigin-RevId: 297186194