Merge branch 'master' into issue-230

gridl · Jul 30, 2019 · 56c032a · 56c032a
2 parents af5df53 + 3b7e2a6
commit 56c032a
Show file tree

Hide file tree

Showing 594 changed files with 24,675 additions and 8,829 deletions.
diff --git a/.gitignore b/.gitignore
@@ -17,4 +17,7 @@ dist/
 .pytest_cache/
 
 # Other
-*.DS_Store
+*.DS_Store
+
+# PyCharm
+.idea
diff --git a/README.md b/README.md
@@ -8,7 +8,13 @@ TensorFlow Datasets provides many public datasets as `tf.data.Datasets`.
 * [List of datasets](https://github.com/tensorflow/datasets/tree/master/docs/datasets.md)
 * [Try it in Colab](https://colab.research.google.com/github/tensorflow/datasets/blob/master/docs/overview.ipynb)
 * [API docs](https://www.tensorflow.org/datasets/api_docs/python/tfds)
-* [Add a dataset](https://github.com/tensorflow/datasets/tree/master/docs/add_dataset.md)
+* Guides
+  * [Overview](https://www.tensorflow.org/datasets/overview)
+  * [Datasets versioning](https://www.tensorflow.org/datasets/datasets_versioning)
+  * [Using splits and slicing API](https://www.tensorflow.org/datasets/splits)
+  * [Add a dataset](https://www.tensorflow.org/datasets/add_dataset)
+  * [Add a huge dataset (>>100GiB)](https://www.tensorflow.org/datasets/beam_datasets)
+
 
 **Table of Contents**
 
@@ -24,7 +30,7 @@ TensorFlow Datasets provides many public datasets as `tf.data.Datasets`.
 ```sh
 pip install tensorflow-datasets
 
-# Requires TF 1.12+ to be installed.
+# Requires TF 1.14+ to be installed.
 # Some datasets require additional libraries; see setup.py extras_require
 pip install tensorflow
 # or:

diff --git a/docs/_book.yaml b/docs/_book.yaml
@@ -19,10 +19,14 @@ upper_tabs:
       contents:
       - title: Overview
         path: /datasets/overview
+      - title: Versioning
+        path: /datasets/datasets_versioning
       - title: Splits
         path: /datasets/splits
       - title: Add a dataset
         path: /datasets/add_dataset
+      - title: Feature decoding
+        path: /datasets/decode
       - title: Add huge datasets
         path: /datasets/beam_datasets
       - title: Store your dataset on GCS

diff --git a/docs/_project.yaml b/docs/_project.yaml
@@ -1,11 +1,11 @@
 name: TensorFlow Datasets
-breadcrumb_name: Datasets v1.0.2
+breadcrumb_name: Datasets v1.1.0
 home_url: /datasets/
 parent_project_metadata_path: /_project.yaml
 description: >
   A collection of datasets ready to use with TensorFlow.
 use_site_branding: true
 hide_from_products_list: true
-content_license: cc3-apache2
+content_license: cc-apache
 buganizer_id: 473701
 include: /_project_included.yaml
diff --git a/docs/add_dataset.md b/docs/add_dataset.md
@@ -26,11 +26,12 @@ already added.
 *   [Create your own `FeatureConnector`](#create-your-own-featureconnector)
 *   [Adding the dataset to `tensorflow/datasets`](#adding-the-dataset-to-tensorflowdatasets)
     *   [1. Add an import for registration](#1-add-an-import-for-registration)
-    *   [2. Run download_and_prepare locally](#2-run-download-and-prepare-locally)
+    *   [2. Run download_and_prepare locally](#2-run-download_and_prepare-locally)
     *   [3. Double-check the citation](#3-double-check-the-citation)
     *   [4. Add a test](#4-add-a-test)
     *   [5. Check your code style](#5-check-your-code-style)
-    *   [6. Send for review!](#6-send-for-review)
+    *   [6. Add release notes](#6-add-release-notes)
+    *   [7. Send for review!](#7-send-for-review)
 *   [Large datasets and distributed generation](#large-datasets-and-distributed-generation)
 *   [Testing `MyDataset`](#testing-mydataset)
 
@@ -102,7 +103,8 @@ Its subclasses implement:
   [`DatasetInfo`](api_docs/python/tfds/core/DatasetInfo.md) object
   describing the dataset
 * `_split_generators`: downloads the source data and defines the dataset splits
-* `_generate_examples`: yields examples in the dataset from the source data
+* `_generate_examples`: yields `(key, example)` tuples in the dataset from the
+  source data
 
 This guide will use `GeneratorBasedBuilder`.
 
@@ -130,13 +132,16 @@ class MyDataset(tfds.core.GeneratorBasedBuilder):
 
   def _generate_examples(self):
     # Yields examples from the dataset
-    pass  # TODO
+    yield 'key', {}
 ```
 
 If you'd like to follow a test-driven development workflow, which can help you
 iterate faster, jump to the [testing instructions](#testing-mydataset) below,
 add the test, and then return here.
 
+For an explanation of what the version is, please read
+[datasets versioning](datasets_versioning.md).
+
 ## Specifying `DatasetInfo`
 
 [`DatasetInfo`](api_docs/python/tfds/core/DatasetInfo.md) describes the
@@ -225,15 +230,13 @@ through [`tfds.Split.subsplit`](splits.md#subsplit).
     return [
         tfds.core.SplitGenerator(
             name=tfds.Split.TRAIN,
-            num_shards=10,
             gen_kwargs={
                 "images_dir_path": os.path.join(extracted_path, "train"),
                 "labels": os.path.join(extracted_path, "train_labels.csv"),
             },
         ),
         tfds.core.SplitGenerator(
             name=tfds.Split.TEST,
-            num_shards=1,
             gen_kwargs={
                 "images_dir_path": os.path.join(extracted_path, "test"),
                 "labels": os.path.join(extracted_path, "test_labels.csv"),
@@ -246,10 +249,6 @@ through [`tfds.Split.subsplit`](splits.md#subsplit).
 will be passed as keyword arguments to `_generate_examples`, which we'll define
 next.
 
-When specifying `num_shards`, which determines how many files the split will
-use, pick a number such that a single shard is less that 4 GiB as
-as each shard will be loaded in memory for shuffling.
-
 ## Writing an example generator
 
 `_generate_examples` generates the examples for each split from the
@@ -264,8 +263,8 @@ builder._generate_examples(
 ```
 
 This method will typically read source dataset artifacts (e.g. a CSV file) and
-yield feature dictionaries that correspond to the features specified in
-`DatasetInfo`.
+yield (key, feature dictionary) tuples that correspond to the features specified
+in `DatasetInfo`.
 
 ```python
 def _generate_examples(self, images_dir_path, labels):
@@ -277,7 +276,7 @@ def _generate_examples(self, images_dir_path, labels):
 
   # And yield examples as feature dictionaries
   for image_id, description, label in data:
-    yield {
+    yield image_id, {
         "image_description": description,
         "image": "%s/%s.jpeg" % (images_dir_path, image_id),
         "label": label,
@@ -289,6 +288,10 @@ format suitable for writing to disk (currently we use `tf.train.Example`
 protocol buffers). For example, `tfds.features.Image` will copy out the
 JPEG content of the passed image files automatically.
 
+The key (here: `image_id`) should uniquely identify the record. It is used to
+shuffle the dataset globally. If two records are yielded using the same key,
+an exception will be raised during preparation of the dataset.
+
 If you've implemented the test harness, your builder test should now pass.
 
 ### File access and `tf.io.gfile`
@@ -551,7 +554,13 @@ See
 [TensorFlow code style guide](https://www.tensorflow.org/community/contribute/code_style)
 for more information.
 
-### 6. Send for review!
+### 6. Add release notes
+
+Add the dataset to the
+[release notes](https://github.com/tensorflow/datasets/blob/master/docs/release_notes.md).
+The release note will be published for the next release.
+
+### 7. Send for review!
 
 Send the pull request for review.
 

diff --git a/docs/api_docs/python/_redirects.yaml b/docs/api_docs/python/_redirects.yaml
@@ -3,6 +3,8 @@ redirects:
   to: /datasets/api_docs/python/tfds/download/GenerateMode
 - from: /datasets/api_docs/python/tfds/testing/FeatureExpectationsTestCase/failureException
   to: /datasets/api_docs/python/tfds/testing/DatasetBuilderTestCase/failureException
+- from: /datasets/api_docs/python/tfds/testing/SubTestCase/failureException
+  to: /datasets/api_docs/python/tfds/testing/DatasetBuilderTestCase/failureException
 - from: /datasets/api_docs/python/tfds/testing/TestCase/failureException
   to: /datasets/api_docs/python/tfds/testing/DatasetBuilderTestCase/failureException
 - from: /datasets/api_docs/python/tfds/features/text

diff --git a/docs/api_docs/python/_toc.yaml b/docs/api_docs/python/_toc.yaml
@@ -32,12 +32,18 @@ toc:
       path: /datasets/api_docs/python/tfds/core/DatasetBuilder
     - title: DatasetInfo
       path: /datasets/api_docs/python/tfds/core/DatasetInfo
+    - title: Experiment
+      path: /datasets/api_docs/python/tfds/core/Experiment
     - title: GeneratorBasedBuilder
       path: /datasets/api_docs/python/tfds/core/GeneratorBasedBuilder
     - title: get_tfds_path
       path: /datasets/api_docs/python/tfds/core/get_tfds_path
     - title: lazy_imports
       path: /datasets/api_docs/python/tfds/core/lazy_imports
+    - title: Metadata
+      path: /datasets/api_docs/python/tfds/core/Metadata
+    - title: MetadataDict
+      path: /datasets/api_docs/python/tfds/core/MetadataDict
     - title: NamedSplit
       path: /datasets/api_docs/python/tfds/core/NamedSplit
     - title: SplitBase
@@ -50,6 +56,16 @@ toc:
       path: /datasets/api_docs/python/tfds/core/SplitInfo
     - title: Version
       path: /datasets/api_docs/python/tfds/core/Version
+  - title: tfds.decode
+    section:
+    - title: Overview
+      path: /datasets/api_docs/python/tfds/decode
+    - title: Decoder
+      path: /datasets/api_docs/python/tfds/decode/Decoder
+    - title: make_decoder
+      path: /datasets/api_docs/python/tfds/decode/make_decoder
+    - title: SkipDecoding
+      path: /datasets/api_docs/python/tfds/decode/SkipDecoding
   - title: tfds.download
     section:
     - title: Overview
@@ -88,8 +104,6 @@ toc:
       path: /datasets/api_docs/python/tfds/features/Image
     - title: Sequence
       path: /datasets/api_docs/python/tfds/features/Sequence
-    - title: SequenceDict
-      path: /datasets/api_docs/python/tfds/features/SequenceDict
     - title: Tensor
       path: /datasets/api_docs/python/tfds/features/Tensor
     - title: TensorInfo
@@ -146,6 +160,8 @@ toc:
       path: /datasets/api_docs/python/tfds/testing/rm_tmp_dir
     - title: run_in_graph_and_eager_modes
       path: /datasets/api_docs/python/tfds/testing/run_in_graph_and_eager_modes
+    - title: SubTestCase
+      path: /datasets/api_docs/python/tfds/testing/SubTestCase
     - title: TestCase
       path: /datasets/api_docs/python/tfds/testing/TestCase
     - title: test_main

diff --git a/docs/api_docs/python/index.md b/docs/api_docs/python/index.md
@@ -10,7 +10,10 @@
 *   <a href="./tfds/core/BuilderConfig.md"><code>tfds.core.BuilderConfig</code></a>
 *   <a href="./tfds/core/DatasetBuilder.md"><code>tfds.core.DatasetBuilder</code></a>
 *   <a href="./tfds/core/DatasetInfo.md"><code>tfds.core.DatasetInfo</code></a>
+*   <a href="./tfds/core/Experiment.md"><code>tfds.core.Experiment</code></a>
 *   <a href="./tfds/core/GeneratorBasedBuilder.md"><code>tfds.core.GeneratorBasedBuilder</code></a>
+*   <a href="./tfds/core/Metadata.md"><code>tfds.core.Metadata</code></a>
+*   <a href="./tfds/core/MetadataDict.md"><code>tfds.core.MetadataDict</code></a>
 *   <a href="./tfds/core/NamedSplit.md"><code>tfds.core.NamedSplit</code></a>
 *   <a href="./tfds/core/SplitBase.md"><code>tfds.core.SplitBase</code></a>
 *   <a href="./tfds/core/SplitDict.md"><code>tfds.core.SplitDict</code></a>
@@ -19,6 +22,10 @@
 *   <a href="./tfds/core/Version.md"><code>tfds.core.Version</code></a>
 *   <a href="./tfds/core/get_tfds_path.md"><code>tfds.core.get_tfds_path</code></a>
 *   <a href="./tfds/core/lazy_imports.md"><code>tfds.core.lazy_imports</code></a>
+*   <a href="./tfds/decode.md"><code>tfds.decode</code></a>
+*   <a href="./tfds/decode/Decoder.md"><code>tfds.decode.Decoder</code></a>
+*   <a href="./tfds/decode/SkipDecoding.md"><code>tfds.decode.SkipDecoding</code></a>
+*   <a href="./tfds/decode/make_decoder.md"><code>tfds.decode.make_decoder</code></a>
 *   <a href="./tfds/disable_progress_bar.md"><code>tfds.disable_progress_bar</code></a>
 *   <a href="./tfds/download.md"><code>tfds.download</code></a>
 *   <a href="./tfds/download/ComputeStatsMode.md"><code>tfds.download.ComputeStatsMode</code></a>
@@ -37,7 +44,6 @@
 *   <a href="./tfds/features/FeaturesDict.md"><code>tfds.features.FeaturesDict</code></a>
 *   <a href="./tfds/features/Image.md"><code>tfds.features.Image</code></a>
 *   <a href="./tfds/features/Sequence.md"><code>tfds.features.Sequence</code></a>
-*   <a href="./tfds/features/SequenceDict.md"><code>tfds.features.SequenceDict</code></a>
 *   <a href="./tfds/features/Tensor.md"><code>tfds.features.Tensor</code></a>
 *   <a href="./tfds/features/TensorInfo.md"><code>tfds.features.TensorInfo</code></a>
 *   <a href="./tfds/features/Text.md"><code>tfds.features.Text</code></a>
@@ -64,6 +70,8 @@
 *   <a href="./tfds/testing/FeatureExpectationItem.md"><code>tfds.testing.FeatureExpectationItem</code></a>
 *   <a href="./tfds/testing/FeatureExpectationsTestCase.md"><code>tfds.testing.FeatureExpectationsTestCase</code></a>
 *   <a href="./tfds/testing/DatasetBuilderTestCase/failureException.md"><code>tfds.testing.FeatureExpectationsTestCase.failureException</code></a>
+*   <a href="./tfds/testing/SubTestCase.md"><code>tfds.testing.SubTestCase</code></a>
+*   <a href="./tfds/testing/DatasetBuilderTestCase/failureException.md"><code>tfds.testing.SubTestCase.failureException</code></a>
 *   <a href="./tfds/testing/TestCase.md"><code>tfds.testing.TestCase</code></a>
 *   <a href="./tfds/testing/DatasetBuilderTestCase/failureException.md"><code>tfds.testing.TestCase.failureException</code></a>
 *   <a href="./tfds/testing/make_tmp_dir.md"><code>tfds.testing.make_tmp_dir</code></a>

diff --git a/docs/api_docs/python/tfds.md b/docs/api_docs/python/tfds.md
@@ -5,11 +5,15 @@
 
 # Module: tfds
 
+<table class="tfo-notebook-buttons tfo-api" align="left">
+</table>
+
+<a target="_blank" href="https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/__init__.py">View
+source</a>
+
 `tensorflow_datasets` (<a href="./tfds.md"><code>tfds</code></a>) defines a
 collection of datasets ready-to-use with TensorFlow.
 
-Defined in [`__init__.py`](https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/__init__.py).
-
 <!-- Placeholder for "Used in" -->
 
 Each dataset is defined as a <a href="./tfds/core/DatasetBuilder.md"><code>tfds.core.DatasetBuilder</code></a>, which encapsulates
@@ -22,47 +26,51 @@ The main library entrypoints are:
 * <a href="./tfds/load.md"><code>tfds.load</code></a>: convenience method to construct a builder, download the data, and
   create an input pipeline, returning a `tf.data.Dataset`.
 
-Documentation:
+#### Documentation:
 
-* These API docs
-* [Available datasets](https://github.com/tensorflow/datasets/tree/master/docs/datasets.md)
-* [Colab tutorial](https://colab.research.google.com/github/tensorflow/datasets/blob/master/docs/overview.ipynb)
-* [Add a dataset](https://github.com/tensorflow/datasets/tree/master/docs/add_dataset.md)
+*   These API docs
+*   [Available datasets](https://github.com/tensorflow/datasets/tree/master/docs/datasets.md)
+*   [Colab tutorial](https://colab.research.google.com/github/tensorflow/datasets/blob/master/docs/overview.ipynb)
+*   [Add a dataset](https://github.com/tensorflow/datasets/tree/master/docs/add_dataset.md)
 
 ## Modules
 
 [`core`](./tfds/core.md) module: API to define datasets.
 
+[`decode`](./tfds/decode.md) module: Decoder public API.
+
 [`download`](./tfds/download.md) module: <a href="./tfds/download/DownloadManager.md"><code>tfds.download.DownloadManager</code></a> API.
 
 [`features`](./tfds/features.md) module: <a href="./tfds/features/FeatureConnector.md"><code>tfds.features.FeatureConnector</code></a> API defining feature types.
 
 [`file_adapter`](./tfds/file_adapter.md) module: <a href="./tfds/file_adapter/FileFormatAdapter.md"><code>tfds.file_adapter.FileFormatAdapter</code></a>s for GeneratorBasedBuilder.
 
-[`units`](./tfds/units.md) module: Defines convenience constants/functions for converting various units.
-
 [`testing`](./tfds/testing.md) module: Testing utilities.
 
+[`units`](./tfds/units.md) module: Defines convenience constants/functions for
+converting various units.
+
 ## Classes
 
 [`class GenerateMode`](./tfds/download/GenerateMode.md): `Enum` for how to treat pre-existing downloads and data.
 
-[`class percent`](./tfds/percent.md): Syntactic sugar for defining slice subsplits: `tfds.percent[75:-5]`.
-
 [`class Split`](./tfds/Split.md): `Enum` for dataset splits.
 
+[`class percent`](./tfds/percent.md): Syntactic sugar for defining slice subsplits: `tfds.percent[75:-5]`.
+
 ## Functions
 
 [`as_numpy(...)`](./tfds/as_numpy.md): Converts a `tf.data.Dataset` to an iterable of NumPy arrays.
 
 [`builder(...)`](./tfds/builder.md): Fetches a <a href="./tfds/core/DatasetBuilder.md"><code>tfds.core.DatasetBuilder</code></a> by string name.
 
-[`list_builders(...)`](./tfds/list_builders.md): Returns the string names of all <a href="./tfds/core/DatasetBuilder.md"><code>tfds.core.DatasetBuilder</code></a>s.
-
-[`load(...)`](./tfds/load.md): Loads the named dataset into a `tf.data.Dataset`.
-
 [`disable_progress_bar(...)`](./tfds/disable_progress_bar.md): Disabled Tqdm
 progress bar.
 
 [`is_dataset_on_gcs(...)`](./tfds/is_dataset_on_gcs.md): If the dataset is
 available on the GCS bucket gs://tfds-data/datasets.
+
+[`list_builders(...)`](./tfds/list_builders.md): Returns the string names of all <a href="./tfds/core/DatasetBuilder.md"><code>tfds.core.DatasetBuilder</code></a>s.
+
+[`load(...)`](./tfds/load.md): Loads the named dataset into a `tf.data.Dataset`.
+