-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update tfds.load
to load datasets from files without using original class.
#2493
Conversation
d5a7e8d
to
e7ff4d1
Compare
… class. **Before:** In `tfds.load('my_dataset')`, the `DatasetBuilder` was always created from the original dataset class (`MyDataset`). **After:** If the version is specified, `tfds.load`/`tfds.builder` first check whether an existing version is found on disk. If found, only the files are used to restore the dataset (without ever instantiating the original generation class `MyDataset`). * `tfds.load('my_dataset:2.*.*')` can load `.../my_dataset/2.0.3/` files even if `MyDataset.VERSION == '3.0.0'`. This improve backward-compatibility. * Dataset can be read even if the generation code isn't reachable anymore. In this case, `tfds.load('my_dataset')` will load the most recent version found on disk. So you can load a dataset generated by someone else without having to import the original dataset code. (caveat: in this case, the config name must be explicit, as TFDS can't currently infer the default config name). PiperOrigin-RevId: 335076918
e7ff4d1
to
358b069
Compare
Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please visit https://cla.developers.google.com/ to sign. Once you've signed (or fixed any issues), please reply here with What to do if you already signed the CLAIndividual signers
Corporate signers
ℹ️ Googlers: Go here for more info. |
API changes, new features: * Dataset-as-folder: Dataset can now be self-contained module in a folder with checksums, dummy data,... This simplify implementing datasets outside the TFDS repository. * `tfds.load` can now load dataset without using the generation class. So `tfds.load('my_dataset:1.0.0')` can work even if `MyDataset.VERSION == '2.0.0'` (See #2493). * Add a new TFDS CLI (see https://www.tensorflow.org/datasets/cli for detail) * `tfds.testing.mock_data` does not require metadata files anymore! * Add `tfds.as_dataframe(ds, ds_info)` with custom visualisation ([example](https://www.tensorflow.org/datasets/overview#tfdsas_dataframe)) * Add `tfds.even_splits` to generate subsplits (e.g. `tfds.even_splits('train', n=3) == ['train[0%:33%]', 'train[33%:67%]', ...]` * Add new `DatasetBuilder.RELEASE_NOTES` property * tfds.features.Image now supports PNG with 4-channels * `tfds.ImageFolder` now supports custom shape, dtype * Downloaded URLs are available through `MyDataset.url_infos` * Add `skip_prefetch` option to `tfds.ReadConfig` * `as_supervised=True` support for `tfds.show_examples`, `tfds.as_dataframe` Breaking compatible changes: * `tfds.as_numpy()` now returns an iterable which can be iterated multiple times. To migrate `next(ds)` -> `next(iter(ds))` * Rename `tfds.features.text.Xyz` -> `tfds.deprecated.text.Xyz` * Remove `DatasetBuilder.IN_DEVELOPMENT` property * Remove `tfds.core.disallow_positional_args` (should use Py3 `*, ` instead) * tfds.features can now be saved/loaded, you may have to overwrite [FeatureConnector.from_json_content](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/FeatureConnector?version=nightly#from_json_content) and `FeatureConnector.to_json_content` to support this feature. * Stop testing against TF 1.15. Requires Python 3.6.8+. Other bug fixes: * Better archive extension detection for `dl_manager.download_and_extract` * Fix `tfds.__version__` in TFDS nightly to be PEP440 compliant * Fix crash when GCS not available * Script to detect dead-urls * Improved open-source workflow, contributor guide, documentation * Many other internal cleanups, bugs, dead code removal, py2->py3 cleanup, pytype annotations,... PiperOrigin-RevId: 335617834
API changes, new features: * Dataset-as-folder: Dataset can now be self-contained module in a folder with checksums, dummy data,... This simplify implementing datasets outside the TFDS repository. * `tfds.load` can now load dataset without using the generation class. So `tfds.load('my_dataset:1.0.0')` can work even if `MyDataset.VERSION == '2.0.0'` (See #2493). * Add a new TFDS CLI (see https://www.tensorflow.org/datasets/cli for detail) * `tfds.testing.mock_data` does not require metadata files anymore! * Add `tfds.as_dataframe(ds, ds_info)` with custom visualisation ([example](https://www.tensorflow.org/datasets/overview#tfdsas_dataframe)) * Add `tfds.even_splits` to generate subsplits (e.g. `tfds.even_splits('train', n=3) == ['train[0%:33%]', 'train[33%:67%]', ...]` * Add new `DatasetBuilder.RELEASE_NOTES` property * tfds.features.Image now supports PNG with 4-channels * `tfds.ImageFolder` now supports custom shape, dtype * Downloaded URLs are available through `MyDataset.url_infos` * Add `skip_prefetch` option to `tfds.ReadConfig` * `as_supervised=True` support for `tfds.show_examples`, `tfds.as_dataframe` Breaking compatible changes: * `tfds.as_numpy()` now returns an iterable which can be iterated multiple times. To migrate `next(ds)` -> `next(iter(ds))` * Rename `tfds.features.text.Xyz` -> `tfds.deprecated.text.Xyz` * Remove `DatasetBuilder.IN_DEVELOPMENT` property * Remove `tfds.core.disallow_positional_args` (should use Py3 `*, ` instead) * tfds.features can now be saved/loaded, you may have to overwrite [FeatureConnector.from_json_content](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/FeatureConnector?version=nightly#from_json_content) and `FeatureConnector.to_json_content` to support this feature. * Stop testing against TF 1.15. Requires Python 3.6.8+. Other bug fixes: * Better archive extension detection for `dl_manager.download_and_extract` * Fix `tfds.__version__` in TFDS nightly to be PEP440 compliant * Fix crash when GCS not available * Script to detect dead-urls * Improved open-source workflow, contributor guide, documentation * Many other internal cleanups, bugs, dead code removal, py2->py3 cleanup, pytype annotations,... And of course, new datasets, datasets updates. A gigantic thanks to our community which has helped us debugging and with the implementation of many features. PiperOrigin-RevId: 335617834
API changes, new features: * Dataset-as-folder: Dataset can now be self-contained module in a folder with checksums, dummy data,... This simplify implementing datasets outside the TFDS repository. * `tfds.load` can now load dataset without using the generation class. So `tfds.load('my_dataset:1.0.0')` can work even if `MyDataset.VERSION == '2.0.0'` (See #2493). * Add a new TFDS CLI (see https://www.tensorflow.org/datasets/cli for detail) * `tfds.testing.mock_data` does not require metadata files anymore! * Add `tfds.as_dataframe(ds, ds_info)` with custom visualisation ([example](https://www.tensorflow.org/datasets/overview#tfdsas_dataframe)) * Add `tfds.even_splits` to generate subsplits (e.g. `tfds.even_splits('train', n=3) == ['train[0%:33%]', 'train[33%:67%]', ...]` * Add new `DatasetBuilder.RELEASE_NOTES` property * tfds.features.Image now supports PNG with 4-channels * `tfds.ImageFolder` now supports custom shape, dtype * Downloaded URLs are available through `MyDataset.url_infos` * Add `skip_prefetch` option to `tfds.ReadConfig` * `as_supervised=True` support for `tfds.show_examples`, `tfds.as_dataframe` Breaking compatible changes: * `tfds.as_numpy()` now returns an iterable which can be iterated multiple times. To migrate `next(ds)` -> `next(iter(ds))` * Rename `tfds.features.text.Xyz` -> `tfds.deprecated.text.Xyz` * Remove `DatasetBuilder.IN_DEVELOPMENT` property * Remove `tfds.core.disallow_positional_args` (should use Py3 `*, ` instead) * tfds.features can now be saved/loaded, you may have to overwrite [FeatureConnector.from_json_content](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/FeatureConnector?version=nightly#from_json_content) and `FeatureConnector.to_json_content` to support this feature. * Stop testing against TF 1.15. Requires Python 3.6.8+. Other bug fixes: * Better archive extension detection for `dl_manager.download_and_extract` * Fix `tfds.__version__` in TFDS nightly to be PEP440 compliant * Fix crash when GCS not available * Script to detect dead-urls * Improved open-source workflow, contributor guide, documentation * Many other internal cleanups, bugs, dead code removal, py2->py3 cleanup, pytype annotations,... And of course, new datasets, datasets updates. A gigantic thanks to our community which has helped us debugging issues and with the implementation of many features, especially vijayphoenix@ which has been one of our main contributor for this release. PiperOrigin-RevId: 335617834
API changes, new features: * Dataset-as-folder: Dataset can now be self-contained module in a folder with checksums, dummy data,... This simplify implementing datasets outside the TFDS repository. * `tfds.load` can now load dataset without using the generation class. So `tfds.load('my_dataset:1.0.0')` can work even if `MyDataset.VERSION == '2.0.0'` (See #2493). * Add a new TFDS CLI (see https://www.tensorflow.org/datasets/cli for detail) * `tfds.testing.mock_data` does not require metadata files anymore! * Add `tfds.as_dataframe(ds, ds_info)` with custom visualisation ([example](https://www.tensorflow.org/datasets/overview#tfdsas_dataframe)) * Add `tfds.even_splits` to generate subsplits (e.g. `tfds.even_splits('train', n=3) == ['train[0%:33%]', 'train[33%:67%]', ...]` * Add new `DatasetBuilder.RELEASE_NOTES` property * tfds.features.Image now supports PNG with 4-channels * `tfds.ImageFolder` now supports custom shape, dtype * Downloaded URLs are available through `MyDataset.url_infos` * Add `skip_prefetch` option to `tfds.ReadConfig` * `as_supervised=True` support for `tfds.show_examples`, `tfds.as_dataframe` Breaking compatible changes: * `tfds.as_numpy()` now returns an iterable which can be iterated multiple times. To migrate `next(ds)` -> `next(iter(ds))` * Rename `tfds.features.text.Xyz` -> `tfds.deprecated.text.Xyz` * Remove `DatasetBuilder.IN_DEVELOPMENT` property * Remove `tfds.core.disallow_positional_args` (should use Py3 `*, ` instead) * tfds.features can now be saved/loaded, you may have to overwrite [FeatureConnector.from_json_content](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/FeatureConnector?version=nightly#from_json_content) and `FeatureConnector.to_json_content` to support this feature. * Stop testing against TF 1.15. Requires Python 3.6.8+. Other bug fixes: * Better archive extension detection for `dl_manager.download_and_extract` * Fix `tfds.__version__` in TFDS nightly to be PEP440 compliant * Fix crash when GCS not available * Script to detect dead-urls * Improved open-source workflow, contributor guide, documentation * Many other internal cleanups, bugs, dead code removal, py2->py3 cleanup, pytype annotations,... And of course, new datasets, datasets updates. A gigantic thanks to our community which has helped us debugging issues and with the implementation of many features, especially vijayphoenix@ which has been one of our main contributor for this release. PiperOrigin-RevId: 335667395
* Add mocking policies * Mock dataset_info file * Minor Changes * Fix imagenet_v2 dataset * CleanUP * clean oxford_flowers102 * Fix `tfds.__version__` for nightly release PiperOrigin-RevId: 333023695 * Disable compute statistics by default * Create conflicts with some versions of TFDV/apache_beam * Slow down the generation speed, while few users requires statistics * Instead, statistics will be computed separately of the generation, with some `builder.compute_statistics()` PiperOrigin-RevId: 333026241 * clean spoken_digit * clean oxford_iiit_pet * Update dtd.py Dataset generated sussessfully * Cleanup code for arc and cbis_ddsm * Clean up code:wq * Raise error if input to even_splits is larger than 100 Currently if n > 100, then there will be duplicate items in partitions. Also updated the test for n=0 and n=101 which should raise ValueError. PiperOrigin-RevId: 333084035 * Cleanup code for Visual Domain Decathlon dataset * Update Config Version * Cleanup code for Geirhos Conflict Stimuli dataset * Fix error message format PiperOrigin-RevId: 333239992 * Add `as_supervised` support for `tfds.as_dataframe` Fix tensorflow#2476 PiperOrigin-RevId: 333241467 * Cleanup IN_DEVELOPMENT property. PiperOrigin-RevId: 333343892 * Switch to standalone tifffile in tensorflow_datasets. PiperOrigin-RevId: 333382851 * Update tfds.find_builder_from_dir to support multiple data_dir PiperOrigin-RevId: 333472732 * Fix nightly `__version__` to be PEP440 compliant PiperOrigin-RevId: 333482028 * Automated documentation update. PiperOrigin-RevId: 333574599 * Fix or ignore some pytype errors. PiperOrigin-RevId: 333581335 * Fix PyPI nightly name PiperOrigin-RevId: 333691260 * Pass kwargs to Image feature connector * Adding landing page calling for partners with external companies. PiperOrigin-RevId: 333758717 * Update Release-Notes * Update comments * Parameterize test with number of channels * Add 4 to ACCEPTABLE_CHANNELS for png encoding * Add deprecation message when using `create_new_dataset` PiperOrigin-RevId: 334839865 * Added release notes to Groove, VCTK and ImageNet * Add release notes for abstract reasoning and binarized mnist * Add release notes to CelebA, CelebAGQ, Clevr * Add release notes for downsampled imagenet and dsprites * Add release notes for lsun and shape3d * Release notes for Bigearthnet,caltech & catsvsdogs * Add release notes for chis_ddsm * Add release notes for cifar10_corrupted * Add release notes for colorectal histology * Add release notes for cyclegan * Add release notes for Deep Weeds * Format Fixes * Update ucf101.py * Update bair_robot_pushing.py * Add release notes for diabetic retinopathy detection * Add release notes for horses_or_humans * Add release notes for imagenet2012_corrupted * Add release notes for imagenet2012_real * Add release notes for imagenet2012_subset * Add release note for imagenet * Add release notes for mnist_corrupted * Add release notes for mnist * Add release notes for patch_camelyon * Add release notes for quickdraw * Add release notes for rock_paper_scissors` * Add release notes for smallnorb * Add release notes for so2sat * Add release notes for svhn * Add release notes for uc_merced * Add release notes for kiti dataset * Add release notes for higgs dataset * Add release notes for iris dataset * Add release notes for titanic dataset * Add release notes for big patent * Add release notes for cnn_dailymail * Add release notes for xsum * Add release notes for c4 dataset * Add release notes for glue dataset * Add release notes for imdb dataset * Add release notes for lm1b dataset * Add release notes for multi_nli * Add release notes for wikipedia * Add release notes for wikipedia_toxicity_subtypes * Add release notes for ted_hrlr * Add release notes for starcraft * Update `tfds.load` to load datasets from files without using original class. **Before:** In `tfds.load('my_dataset')`, the `DatasetBuilder` was always created from the original dataset class (`MyDataset`). **After:** If the version is specified, `tfds.load`/`tfds.builder` first check whether an existing version is found on disk. If found, only the files are used to restore the dataset (without ever instantiating the original generation class `MyDataset`). * `tfds.load('my_dataset:2.*.*')` can load `.../my_dataset/2.0.3/` files even if `MyDataset.VERSION == '3.0.0'`. This improve backward-compatibility. * Dataset can be read even if the generation code isn't reachable anymore. In this case, `tfds.load('my_dataset')` will load the most recent version found on disk. So you can load a dataset generated by someone else without having to import the original dataset code. (caveat: in this case, the config name must be explicit, as TFDS can't currently infer the default config name). PiperOrigin-RevId: 335076918 * Update deprecated argument for map_fn PiperOrigin-RevId: 335080599 * Split test_utils into feature_test_cases.py To avoid circular deps `test_utils.mock_tf` -> `test_case.TestCase` -> `test_utils.FeatureExpectationTestCase`, `FeatureExpectationTestCase` is moved to a new `feauture_test_case.py` PiperOrigin-RevId: 335081584 * Format fixes * Format fixes * Imagenet release notes * Revert "Imagenet release notes" This reverts commit 53da06d. * Format fixes` * Format fixes * Add setup_teardown.py to share setup between pytest/unittest PiperOrigin-RevId: 335168637 * Fix comment in ANLI dataset. PiperOrigin-RevId: 335169563 * Additional mock gfile API fixes PiperOrigin-RevId: 335173931 * Add release notes for nsynth * Add release notes for omniglot dataset * Add release notes for open-images dataset * Add release notes for moving_mnist * Automated documentation update. PiperOrigin-RevId: 335215096 * Update DatasetBuilder to use version.list_all_versions PiperOrigin-RevId: 335220465 * Automated documentation update. PiperOrigin-RevId: 335476272 * Improve debug message for audio file encoding (Fix tensorflow#2513) PiperOrigin-RevId: 335587214 * Update TFDS version to 4.0.0 API changes, new features: * Dataset-as-folder: Dataset can now be self-contained module in a folder with checksums, dummy data,... This simplify implementing datasets outside the TFDS repository. * `tfds.load` can now load dataset without using the generation class. So `tfds.load('my_dataset:1.0.0')` can work even if `MyDataset.VERSION == '2.0.0'` (See tensorflow#2493). * Add a new TFDS CLI (see https://www.tensorflow.org/datasets/cli for detail) * `tfds.testing.mock_data` does not require metadata files anymore! * Add `tfds.as_dataframe(ds, ds_info)` with custom visualisation ([example](https://www.tensorflow.org/datasets/overview#tfdsas_dataframe)) * Add `tfds.even_splits` to generate subsplits (e.g. `tfds.even_splits('train', n=3) == ['train[0%:33%]', 'train[33%:67%]', ...]` * Add new `DatasetBuilder.RELEASE_NOTES` property * tfds.features.Image now supports PNG with 4-channels * `tfds.ImageFolder` now supports custom shape, dtype * Downloaded URLs are available through `MyDataset.url_infos` * Add `skip_prefetch` option to `tfds.ReadConfig` * `as_supervised=True` support for `tfds.show_examples`, `tfds.as_dataframe` Breaking compatible changes: * `tfds.as_numpy()` now returns an iterable which can be iterated multiple times. To migrate `next(ds)` -> `next(iter(ds))` * Rename `tfds.features.text.Xyz` -> `tfds.deprecated.text.Xyz` * Remove `DatasetBuilder.IN_DEVELOPMENT` property * Remove `tfds.core.disallow_positional_args` (should use Py3 `*, ` instead) * tfds.features can now be saved/loaded, you may have to overwrite [FeatureConnector.from_json_content](https://www.tensorflow.org/datasets/api_docs/python/tfds/features/FeatureConnector?version=nightly#from_json_content) and `FeatureConnector.to_json_content` to support this feature. * Stop testing against TF 1.15. Requires Python 3.6.8+. Other bug fixes: * Better archive extension detection for `dl_manager.download_and_extract` * Fix `tfds.__version__` in TFDS nightly to be PEP440 compliant * Fix crash when GCS not available * Script to detect dead-urls * Improved open-source workflow, contributor guide, documentation * Many other internal cleanups, bugs, dead code removal, py2->py3 cleanup, pytype annotations,... And of course, new datasets, datasets updates. A gigantic thanks to our community which has helped us debugging issues and with the implementation of many features, especially vijayphoenix@ which has been one of our main contributor for this release. PiperOrigin-RevId: 335667395 * Add Hellaswag dataset to TFDS. PiperOrigin-RevId: 335692901 Co-authored-by: vijayphoenix <cs17btech11040@iith.ac.in> Co-authored-by: Ak_0330 <ak03032000@gmail.com> Co-authored-by: SuryashankarDas <suryashankardas.2002@gmail.com> Co-authored-by: Etienne Pot <epot@google.com> Co-authored-by: Keshav.J.Bajaj <46819436+Keshav15@users.noreply.github.com> Co-authored-by: NikhilBartwal <nikhilbartwal1234@gmail.com> Co-authored-by: TensorFlow Datasets Team <no-reply@google.com> Co-authored-by: Copybara-Service <copybara-worker@google.com> Co-authored-by: Tobias Maier <tobias.maier@humai.tech> Co-authored-by: Keshav.J.Bajaj <keshavbajaj4444@gmail.com> Co-authored-by: Sharan Narang <sharannarang@google.com>
Update
tfds.load
to load datasets from files without using original class.Before: In
tfds.load('my_dataset')
, theDatasetBuilder
was always created from the original dataset class (MyDataset
).After: If the version is specified,
tfds.load
/tfds.builder
first check whether an existing version is found on disk. If found, only the files are used to restore the dataset (without ever instantiating the original generation classMyDataset
).tfds.load('my_dataset:2.*.*')
can load.../my_dataset/2.0.3/
files even ifMyDataset.VERSION == '3.0.0'
. This improve backward-compatibility.Dataset can be read even if the generation code isn't reachable anymore. In this case,
tfds.load('my_dataset')
will load the most recent version found on disk. So you can load a dataset generated by someone else without having to import the original dataset code. (caveat: in this case, the config name must be explicit, as TFDS can't currently infer the default config name).Note: This require dataset generated with TFDS 4.0.0+