Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update tfds.load to load datasets from files without using original class. #2493

Merged
merged 1 commit into from
Oct 2, 2020

Conversation

tfds-copybara
Copy link
Collaborator

@tfds-copybara tfds-copybara commented Sep 25, 2020

Update tfds.load to load datasets from files without using original class.

Before: In tfds.load('my_dataset'), the DatasetBuilder was always created from the original dataset class (MyDataset).

After: If the version is specified, tfds.load/tfds.builder first check whether an existing version is found on disk. If found, only the files are used to restore the dataset (without ever instantiating the original generation class MyDataset).

  • tfds.load('my_dataset:2.*.*') can load .../my_dataset/2.0.3/ files even if MyDataset.VERSION == '3.0.0'. This improve backward-compatibility.

  • Dataset can be read even if the generation code isn't reachable anymore. In this case, tfds.load('my_dataset') will load the most recent version found on disk. So you can load a dataset generated by someone else without having to import the original dataset code. (caveat: in this case, the config name must be explicit, as TFDS can't currently infer the default config name).

Note: This require dataset generated with TFDS 4.0.0+

@googlebot googlebot added the cla: yes Author has signed CLA label Sep 25, 2020
@tfds-copybara tfds-copybara force-pushed the cl_331954495 branch 6 times, most recently from d5a7e8d to e7ff4d1 Compare October 2, 2020 18:14
… class.

**Before:** In `tfds.load('my_dataset')`, the `DatasetBuilder` was always created from the original dataset class (`MyDataset`).

**After:** If the version is specified, `tfds.load`/`tfds.builder` first check whether an existing version is found on disk. If found, only the files are used to restore the dataset (without ever instantiating the original generation class `MyDataset`).

* `tfds.load('my_dataset:2.*.*')` can load `.../my_dataset/2.0.3/` files even if `MyDataset.VERSION == '3.0.0'`. This improve backward-compatibility.

* Dataset can be read even if the generation code isn't reachable anymore. In this case, `tfds.load('my_dataset')` will load the most recent version found on disk. So you can load a dataset generated by someone else without having to import the original dataset code. (caveat: in this case, the config name must be explicit, as TFDS can't currently infer the default config name).

PiperOrigin-RevId: 335076918
@googlebot
Copy link

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

@googlebot googlebot added cla: no Author has not signed CLA and removed cla: yes Author has signed CLA labels Oct 2, 2020
@tfds-copybara tfds-copybara merged commit 358b069 into master Oct 2, 2020
@tfds-copybara tfds-copybara deleted the cl_331954495 branch October 2, 2020 19:33
tfds-copybara pushed a commit that referenced this pull request Oct 6, 2020
API changes, new features:

* Dataset-as-folder: Dataset can now be self-contained module in a folder with checksums, dummy data,... This simplify implementing datasets outside the TFDS repository.
* `tfds.load` can now load dataset without using the generation class. So `tfds.load('my_dataset:1.0.0')` can work even if `MyDataset.VERSION == '2.0.0'` (See #2493).
* Add a new TFDS CLI (see https://www.tensorflow.org/datasets/cli for detail)
* `tfds.testing.mock_data` does not require metadata files anymore!
* Add `tfds.as_dataframe(ds, ds_info)` with custom visualisation ([example](https://www.tensorflow.org/datasets/overview#tfdsas_dataframe))
* Add `tfds.even_splits` to generate subsplits (e.g. `tfds.even_splits('train', n=3) == ['train[0%:33%]', 'train[33%:67%]', ...]`
* Add new `DatasetBuilder.RELEASE_NOTES` property
* tfds.features.Image now supports PNG with 4-channels
* `tfds.ImageFolder` now supports custom shape, dtype
* Downloaded URLs are available through `MyDataset.url_infos`
* Add `skip_prefetch` option to `tfds.ReadConfig`
* `as_supervised=True` support for `tfds.show_examples`, `tfds.as_dataframe`

Breaking compatible changes:

* `tfds.as_numpy()` now returns an iterable which can be iterated multiple times. To migrate `next(ds)` -> `next(iter(ds))`
* Rename `tfds.features.text.Xyz` -> `tfds.deprecated.text.Xyz`
* Remove `DatasetBuilder.IN_DEVELOPMENT` property
* Remove `tfds.core.disallow_positional_args` (should use Py3 `*, ` instead)
* tfds.features can now be saved/loaded, you may have to overwrite [FeatureConnector.from_json_content](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/FeatureConnector?version=nightly#from_json_content) and `FeatureConnector.to_json_content` to support this feature.
* Stop testing against TF 1.15. Requires Python 3.6.8+.

Other bug fixes:

* Better archive extension detection for `dl_manager.download_and_extract`
* Fix `tfds.__version__` in TFDS nightly to be PEP440 compliant
* Fix crash when GCS not available
* Script to detect dead-urls
* Improved open-source workflow, contributor guide, documentation
* Many other internal cleanups, bugs, dead code removal, py2->py3 cleanup, pytype annotations,...

PiperOrigin-RevId: 335617834
tfds-copybara pushed a commit that referenced this pull request Oct 6, 2020
API changes, new features:

* Dataset-as-folder: Dataset can now be self-contained module in a folder with checksums, dummy data,... This simplify implementing datasets outside the TFDS repository.
* `tfds.load` can now load dataset without using the generation class. So `tfds.load('my_dataset:1.0.0')` can work even if `MyDataset.VERSION == '2.0.0'` (See #2493).
* Add a new TFDS CLI (see https://www.tensorflow.org/datasets/cli for detail)
* `tfds.testing.mock_data` does not require metadata files anymore!
* Add `tfds.as_dataframe(ds, ds_info)` with custom visualisation ([example](https://www.tensorflow.org/datasets/overview#tfdsas_dataframe))
* Add `tfds.even_splits` to generate subsplits (e.g. `tfds.even_splits('train', n=3) == ['train[0%:33%]', 'train[33%:67%]', ...]`
* Add new `DatasetBuilder.RELEASE_NOTES` property
* tfds.features.Image now supports PNG with 4-channels
* `tfds.ImageFolder` now supports custom shape, dtype
* Downloaded URLs are available through `MyDataset.url_infos`
* Add `skip_prefetch` option to `tfds.ReadConfig`
* `as_supervised=True` support for `tfds.show_examples`, `tfds.as_dataframe`

Breaking compatible changes:

* `tfds.as_numpy()` now returns an iterable which can be iterated multiple times. To migrate `next(ds)` -> `next(iter(ds))`
* Rename `tfds.features.text.Xyz` -> `tfds.deprecated.text.Xyz`
* Remove `DatasetBuilder.IN_DEVELOPMENT` property
* Remove `tfds.core.disallow_positional_args` (should use Py3 `*, ` instead)
* tfds.features can now be saved/loaded, you may have to overwrite [FeatureConnector.from_json_content](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/FeatureConnector?version=nightly#from_json_content) and `FeatureConnector.to_json_content` to support this feature.
* Stop testing against TF 1.15. Requires Python 3.6.8+.

Other bug fixes:

* Better archive extension detection for `dl_manager.download_and_extract`
* Fix `tfds.__version__` in TFDS nightly to be PEP440 compliant
* Fix crash when GCS not available
* Script to detect dead-urls
* Improved open-source workflow, contributor guide, documentation
* Many other internal cleanups, bugs, dead code removal, py2->py3 cleanup, pytype annotations,...

And of course, new datasets, datasets updates.

A gigantic thanks to our community which has helped us debugging and with the implementation of many features.

PiperOrigin-RevId: 335617834
tfds-copybara pushed a commit that referenced this pull request Oct 6, 2020
API changes, new features:

* Dataset-as-folder: Dataset can now be self-contained module in a folder with checksums, dummy data,... This simplify implementing datasets outside the TFDS repository.
* `tfds.load` can now load dataset without using the generation class. So `tfds.load('my_dataset:1.0.0')` can work even if `MyDataset.VERSION == '2.0.0'` (See #2493).
* Add a new TFDS CLI (see https://www.tensorflow.org/datasets/cli for detail)
* `tfds.testing.mock_data` does not require metadata files anymore!
* Add `tfds.as_dataframe(ds, ds_info)` with custom visualisation ([example](https://www.tensorflow.org/datasets/overview#tfdsas_dataframe))
* Add `tfds.even_splits` to generate subsplits (e.g. `tfds.even_splits('train', n=3) == ['train[0%:33%]', 'train[33%:67%]', ...]`
* Add new `DatasetBuilder.RELEASE_NOTES` property
* tfds.features.Image now supports PNG with 4-channels
* `tfds.ImageFolder` now supports custom shape, dtype
* Downloaded URLs are available through `MyDataset.url_infos`
* Add `skip_prefetch` option to `tfds.ReadConfig`
* `as_supervised=True` support for `tfds.show_examples`, `tfds.as_dataframe`

Breaking compatible changes:

* `tfds.as_numpy()` now returns an iterable which can be iterated multiple times. To migrate `next(ds)` -> `next(iter(ds))`
* Rename `tfds.features.text.Xyz` -> `tfds.deprecated.text.Xyz`
* Remove `DatasetBuilder.IN_DEVELOPMENT` property
* Remove `tfds.core.disallow_positional_args` (should use Py3 `*, ` instead)
* tfds.features can now be saved/loaded, you may have to overwrite [FeatureConnector.from_json_content](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/FeatureConnector?version=nightly#from_json_content) and `FeatureConnector.to_json_content` to support this feature.
* Stop testing against TF 1.15. Requires Python 3.6.8+.

Other bug fixes:

* Better archive extension detection for `dl_manager.download_and_extract`
* Fix `tfds.__version__` in TFDS nightly to be PEP440 compliant
* Fix crash when GCS not available
* Script to detect dead-urls
* Improved open-source workflow, contributor guide, documentation
* Many other internal cleanups, bugs, dead code removal, py2->py3 cleanup, pytype annotations,...

And of course, new datasets, datasets updates.

A gigantic thanks to our community which has helped us debugging issues and with the implementation of many features, especially vijayphoenix@ which has been one of our main contributor for this release.

PiperOrigin-RevId: 335617834
tfds-copybara pushed a commit that referenced this pull request Oct 6, 2020
API changes, new features:

* Dataset-as-folder: Dataset can now be self-contained module in a folder with checksums, dummy data,... This simplify implementing datasets outside the TFDS repository.
* `tfds.load` can now load dataset without using the generation class. So `tfds.load('my_dataset:1.0.0')` can work even if `MyDataset.VERSION == '2.0.0'` (See #2493).
* Add a new TFDS CLI (see https://www.tensorflow.org/datasets/cli for detail)
* `tfds.testing.mock_data` does not require metadata files anymore!
* Add `tfds.as_dataframe(ds, ds_info)` with custom visualisation ([example](https://www.tensorflow.org/datasets/overview#tfdsas_dataframe))
* Add `tfds.even_splits` to generate subsplits (e.g. `tfds.even_splits('train', n=3) == ['train[0%:33%]', 'train[33%:67%]', ...]`
* Add new `DatasetBuilder.RELEASE_NOTES` property
* tfds.features.Image now supports PNG with 4-channels
* `tfds.ImageFolder` now supports custom shape, dtype
* Downloaded URLs are available through `MyDataset.url_infos`
* Add `skip_prefetch` option to `tfds.ReadConfig`
* `as_supervised=True` support for `tfds.show_examples`, `tfds.as_dataframe`

Breaking compatible changes:

* `tfds.as_numpy()` now returns an iterable which can be iterated multiple times. To migrate `next(ds)` -> `next(iter(ds))`
* Rename `tfds.features.text.Xyz` -> `tfds.deprecated.text.Xyz`
* Remove `DatasetBuilder.IN_DEVELOPMENT` property
* Remove `tfds.core.disallow_positional_args` (should use Py3 `*, ` instead)
* tfds.features can now be saved/loaded, you may have to overwrite [FeatureConnector.from_json_content](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/FeatureConnector?version=nightly#from_json_content) and `FeatureConnector.to_json_content` to support this feature.
* Stop testing against TF 1.15. Requires Python 3.6.8+.

Other bug fixes:

* Better archive extension detection for `dl_manager.download_and_extract`
* Fix `tfds.__version__` in TFDS nightly to be PEP440 compliant
* Fix crash when GCS not available
* Script to detect dead-urls
* Improved open-source workflow, contributor guide, documentation
* Many other internal cleanups, bugs, dead code removal, py2->py3 cleanup, pytype annotations,...

And of course, new datasets, datasets updates.

A gigantic thanks to our community which has helped us debugging issues and with the implementation of many features, especially vijayphoenix@ which has been one of our main contributor for this release.

PiperOrigin-RevId: 335667395
axd465 added a commit to axd465/datasets that referenced this pull request Oct 7, 2020
* Add mocking policies

* Mock dataset_info file

* Minor Changes

* Fix imagenet_v2 dataset

* CleanUP

* clean oxford_flowers102

* Fix `tfds.__version__` for nightly release

PiperOrigin-RevId: 333023695

* Disable compute statistics by default

* Create conflicts with some versions of TFDV/apache_beam
* Slow down the generation speed, while few users requires statistics
* Instead, statistics will be computed separately of the generation, with some `builder.compute_statistics()`

PiperOrigin-RevId: 333026241

* clean spoken_digit

* clean oxford_iiit_pet

* Update dtd.py

Dataset generated sussessfully

* Cleanup code for arc and cbis_ddsm

* Clean up code:wq

* Raise error if input to even_splits is larger than 100

Currently if n > 100, then there will be duplicate items in partitions.
Also updated the test for n=0 and n=101 which should raise ValueError.

PiperOrigin-RevId: 333084035

* Cleanup code for Visual Domain Decathlon dataset

* Update Config Version

* Cleanup code for Geirhos Conflict Stimuli dataset

* Fix error message format

PiperOrigin-RevId: 333239992

* Add `as_supervised` support for `tfds.as_dataframe`

Fix tensorflow#2476

PiperOrigin-RevId: 333241467

* Cleanup IN_DEVELOPMENT property.

PiperOrigin-RevId: 333343892

* Switch to standalone tifffile in tensorflow_datasets.

PiperOrigin-RevId: 333382851

* Update tfds.find_builder_from_dir to support multiple data_dir

PiperOrigin-RevId: 333472732

* Fix nightly `__version__` to be PEP440 compliant

PiperOrigin-RevId: 333482028

* Automated documentation update.

PiperOrigin-RevId: 333574599

* Fix or ignore some pytype errors.

PiperOrigin-RevId: 333581335

* Fix PyPI nightly name

PiperOrigin-RevId: 333691260

* Pass kwargs to Image feature connector

* Adding landing page calling for partners with external companies.

PiperOrigin-RevId: 333758717

* Update Release-Notes

* Update comments

* Parameterize test with number of channels

* Add 4 to ACCEPTABLE_CHANNELS for png encoding

* Add deprecation message when using `create_new_dataset`

PiperOrigin-RevId: 334839865

* Added release notes to Groove, VCTK and ImageNet

* Add release notes for abstract reasoning and binarized mnist

* Add release notes to CelebA, CelebAGQ, Clevr

* Add release notes for downsampled imagenet and dsprites

* Add release notes for lsun and shape3d

* Release notes for Bigearthnet,caltech & catsvsdogs

* Add release notes for chis_ddsm

* Add release notes for cifar10_corrupted

* Add release notes for colorectal histology

* Add release notes for cyclegan

* Add release notes for Deep Weeds

* Format Fixes

* Update ucf101.py

* Update bair_robot_pushing.py

* Add release notes for diabetic retinopathy detection

* Add release notes for horses_or_humans

* Add release notes for imagenet2012_corrupted

* Add release notes for imagenet2012_real

* Add release notes for imagenet2012_subset

* Add release note for imagenet

* Add release notes for mnist_corrupted

* Add release notes for mnist

* Add release notes for patch_camelyon

* Add release notes for quickdraw

* Add release notes for rock_paper_scissors`

* Add release notes for smallnorb

* Add release notes for so2sat

* Add release notes for svhn

* Add release notes for uc_merced

* Add release notes for kiti dataset

* Add release notes for higgs dataset

* Add release notes for iris dataset

* Add release notes for titanic dataset

* Add release notes for big patent

* Add release notes for cnn_dailymail

* Add release notes for xsum

* Add release notes for c4 dataset

* Add release notes for glue dataset

* Add release notes for imdb dataset

* Add release notes for lm1b dataset

* Add release notes for multi_nli

* Add release notes for wikipedia

* Add release notes for wikipedia_toxicity_subtypes

* Add release notes for ted_hrlr

* Add release notes for starcraft

* Update `tfds.load` to load datasets from files without using original class.

**Before:** In `tfds.load('my_dataset')`, the `DatasetBuilder` was always created from the original dataset class (`MyDataset`).

**After:** If the version is specified, `tfds.load`/`tfds.builder` first check whether an existing version is found on disk. If found, only the files are used to restore the dataset (without ever instantiating the original generation class `MyDataset`).

* `tfds.load('my_dataset:2.*.*')` can load `.../my_dataset/2.0.3/` files even if `MyDataset.VERSION == '3.0.0'`. This improve backward-compatibility.

* Dataset can be read even if the generation code isn't reachable anymore. In this case, `tfds.load('my_dataset')` will load the most recent version found on disk. So you can load a dataset generated by someone else without having to import the original dataset code. (caveat: in this case, the config name must be explicit, as TFDS can't currently infer the default config name).

PiperOrigin-RevId: 335076918

* Update deprecated argument for map_fn

PiperOrigin-RevId: 335080599

* Split test_utils into feature_test_cases.py

To avoid circular deps `test_utils.mock_tf` -> `test_case.TestCase` -> `test_utils.FeatureExpectationTestCase`, `FeatureExpectationTestCase` is moved to a new `feauture_test_case.py`

PiperOrigin-RevId: 335081584

* Format fixes

* Format fixes

* Imagenet release notes

* Revert "Imagenet release notes"

This reverts commit 53da06d.

* Format fixes`

* Format fixes

* Add setup_teardown.py to share setup between pytest/unittest

PiperOrigin-RevId: 335168637

* Fix comment in ANLI dataset.

PiperOrigin-RevId: 335169563

* Additional mock gfile API fixes

PiperOrigin-RevId: 335173931

* Add release notes for nsynth

* Add release notes for omniglot dataset

* Add release notes for open-images dataset

* Add release notes for moving_mnist

* Automated documentation update.

PiperOrigin-RevId: 335215096

* Update DatasetBuilder to use version.list_all_versions

PiperOrigin-RevId: 335220465

* Automated documentation update.

PiperOrigin-RevId: 335476272

* Improve debug message for audio file encoding (Fix tensorflow#2513)

PiperOrigin-RevId: 335587214

* Update TFDS version to 4.0.0

API changes, new features:

* Dataset-as-folder: Dataset can now be self-contained module in a folder with checksums, dummy data,... This simplify implementing datasets outside the TFDS repository.
* `tfds.load` can now load dataset without using the generation class. So `tfds.load('my_dataset:1.0.0')` can work even if `MyDataset.VERSION == '2.0.0'` (See tensorflow#2493).
* Add a new TFDS CLI (see https://www.tensorflow.org/datasets/cli for detail)
* `tfds.testing.mock_data` does not require metadata files anymore!
* Add `tfds.as_dataframe(ds, ds_info)` with custom visualisation ([example](https://www.tensorflow.org/datasets/overview#tfdsas_dataframe))
* Add `tfds.even_splits` to generate subsplits (e.g. `tfds.even_splits('train', n=3) == ['train[0%:33%]', 'train[33%:67%]', ...]`
* Add new `DatasetBuilder.RELEASE_NOTES` property
* tfds.features.Image now supports PNG with 4-channels
* `tfds.ImageFolder` now supports custom shape, dtype
* Downloaded URLs are available through `MyDataset.url_infos`
* Add `skip_prefetch` option to `tfds.ReadConfig`
* `as_supervised=True` support for `tfds.show_examples`, `tfds.as_dataframe`

Breaking compatible changes:

* `tfds.as_numpy()` now returns an iterable which can be iterated multiple times. To migrate `next(ds)` -> `next(iter(ds))`
* Rename `tfds.features.text.Xyz` -> `tfds.deprecated.text.Xyz`
* Remove `DatasetBuilder.IN_DEVELOPMENT` property
* Remove `tfds.core.disallow_positional_args` (should use Py3 `*, ` instead)
* tfds.features can now be saved/loaded, you may have to overwrite [FeatureConnector.from_json_content](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/FeatureConnector?version=nightly#from_json_content) and `FeatureConnector.to_json_content` to support this feature.
* Stop testing against TF 1.15. Requires Python 3.6.8+.

Other bug fixes:

* Better archive extension detection for `dl_manager.download_and_extract`
* Fix `tfds.__version__` in TFDS nightly to be PEP440 compliant
* Fix crash when GCS not available
* Script to detect dead-urls
* Improved open-source workflow, contributor guide, documentation
* Many other internal cleanups, bugs, dead code removal, py2->py3 cleanup, pytype annotations,...

And of course, new datasets, datasets updates.

A gigantic thanks to our community which has helped us debugging issues and with the implementation of many features, especially vijayphoenix@ which has been one of our main contributor for this release.

PiperOrigin-RevId: 335667395

* Add Hellaswag dataset to TFDS.

PiperOrigin-RevId: 335692901

Co-authored-by: vijayphoenix <cs17btech11040@iith.ac.in>
Co-authored-by: Ak_0330 <ak03032000@gmail.com>
Co-authored-by: SuryashankarDas <suryashankardas.2002@gmail.com>
Co-authored-by: Etienne Pot <epot@google.com>
Co-authored-by: Keshav.J.Bajaj <46819436+Keshav15@users.noreply.github.com>
Co-authored-by: NikhilBartwal <nikhilbartwal1234@gmail.com>
Co-authored-by: TensorFlow Datasets Team <no-reply@google.com>
Co-authored-by: Copybara-Service <copybara-worker@google.com>
Co-authored-by: Tobias Maier <tobias.maier@humai.tech>
Co-authored-by: Keshav.J.Bajaj <keshavbajaj4444@gmail.com>
Co-authored-by: Sharan Narang <sharannarang@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: no Author has not signed CLA
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants