Skip to content

Commit

Permalink
Merge branch 'master' into issue-230
Browse files Browse the repository at this point in the history
  • Loading branch information
cyfra authored Jul 30, 2019
2 parents af5df53 + 3b7e2a6 commit 56c032a
Show file tree
Hide file tree
Showing 594 changed files with 24,675 additions and 8,829 deletions.
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,7 @@ dist/
.pytest_cache/

# Other
*.DS_Store
*.DS_Store

# PyCharm
.idea
10 changes: 8 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,13 @@ TensorFlow Datasets provides many public datasets as `tf.data.Datasets`.
* [List of datasets](https://github.com/tensorflow/datasets/tree/master/docs/datasets.md)
* [Try it in Colab](https://colab.research.google.com/github/tensorflow/datasets/blob/master/docs/overview.ipynb)
* [API docs](https://www.tensorflow.org/datasets/api_docs/python/tfds)
* [Add a dataset](https://github.com/tensorflow/datasets/tree/master/docs/add_dataset.md)
* Guides
* [Overview](https://www.tensorflow.org/datasets/overview)
* [Datasets versioning](https://www.tensorflow.org/datasets/datasets_versioning)
* [Using splits and slicing API](https://www.tensorflow.org/datasets/splits)
* [Add a dataset](https://www.tensorflow.org/datasets/add_dataset)
* [Add a huge dataset (>>100GiB)](https://www.tensorflow.org/datasets/beam_datasets)


**Table of Contents**

Expand All @@ -24,7 +30,7 @@ TensorFlow Datasets provides many public datasets as `tf.data.Datasets`.
```sh
pip install tensorflow-datasets

# Requires TF 1.12+ to be installed.
# Requires TF 1.14+ to be installed.
# Some datasets require additional libraries; see setup.py extras_require
pip install tensorflow
# or:
Expand Down
4 changes: 4 additions & 0 deletions docs/_book.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,14 @@ upper_tabs:
contents:
- title: Overview
path: /datasets/overview
- title: Versioning
path: /datasets/datasets_versioning
- title: Splits
path: /datasets/splits
- title: Add a dataset
path: /datasets/add_dataset
- title: Feature decoding
path: /datasets/decode
- title: Add huge datasets
path: /datasets/beam_datasets
- title: Store your dataset on GCS
Expand Down
4 changes: 2 additions & 2 deletions docs/_project.yaml
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
name: TensorFlow Datasets
breadcrumb_name: Datasets v1.0.2
breadcrumb_name: Datasets v1.1.0
home_url: /datasets/
parent_project_metadata_path: /_project.yaml
description: >
A collection of datasets ready to use with TensorFlow.
use_site_branding: true
hide_from_products_list: true
content_license: cc3-apache2
content_license: cc-apache
buganizer_id: 473701
include: /_project_included.yaml
37 changes: 23 additions & 14 deletions docs/add_dataset.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,11 +26,12 @@ already added.
* [Create your own `FeatureConnector`](#create-your-own-featureconnector)
* [Adding the dataset to `tensorflow/datasets`](#adding-the-dataset-to-tensorflowdatasets)
* [1. Add an import for registration](#1-add-an-import-for-registration)
* [2. Run download_and_prepare locally](#2-run-download-and-prepare-locally)
* [2. Run download_and_prepare locally](#2-run-download_and_prepare-locally)
* [3. Double-check the citation](#3-double-check-the-citation)
* [4. Add a test](#4-add-a-test)
* [5. Check your code style](#5-check-your-code-style)
* [6. Send for review!](#6-send-for-review)
* [6. Add release notes](#6-add-release-notes)
* [7. Send for review!](#7-send-for-review)
* [Large datasets and distributed generation](#large-datasets-and-distributed-generation)
* [Testing `MyDataset`](#testing-mydataset)

Expand Down Expand Up @@ -102,7 +103,8 @@ Its subclasses implement:
[`DatasetInfo`](api_docs/python/tfds/core/DatasetInfo.md) object
describing the dataset
* `_split_generators`: downloads the source data and defines the dataset splits
* `_generate_examples`: yields examples in the dataset from the source data
* `_generate_examples`: yields `(key, example)` tuples in the dataset from the
source data

This guide will use `GeneratorBasedBuilder`.

Expand Down Expand Up @@ -130,13 +132,16 @@ class MyDataset(tfds.core.GeneratorBasedBuilder):

def _generate_examples(self):
# Yields examples from the dataset
pass # TODO
yield 'key', {}
```

If you'd like to follow a test-driven development workflow, which can help you
iterate faster, jump to the [testing instructions](#testing-mydataset) below,
add the test, and then return here.

For an explanation of what the version is, please read
[datasets versioning](datasets_versioning.md).

## Specifying `DatasetInfo`

[`DatasetInfo`](api_docs/python/tfds/core/DatasetInfo.md) describes the
Expand Down Expand Up @@ -225,15 +230,13 @@ through [`tfds.Split.subsplit`](splits.md#subsplit).
return [
tfds.core.SplitGenerator(
name=tfds.Split.TRAIN,
num_shards=10,
gen_kwargs={
"images_dir_path": os.path.join(extracted_path, "train"),
"labels": os.path.join(extracted_path, "train_labels.csv"),
},
),
tfds.core.SplitGenerator(
name=tfds.Split.TEST,
num_shards=1,
gen_kwargs={
"images_dir_path": os.path.join(extracted_path, "test"),
"labels": os.path.join(extracted_path, "test_labels.csv"),
Expand All @@ -246,10 +249,6 @@ through [`tfds.Split.subsplit`](splits.md#subsplit).
will be passed as keyword arguments to `_generate_examples`, which we'll define
next.

When specifying `num_shards`, which determines how many files the split will
use, pick a number such that a single shard is less that 4 GiB as
as each shard will be loaded in memory for shuffling.

## Writing an example generator

`_generate_examples` generates the examples for each split from the
Expand All @@ -264,8 +263,8 @@ builder._generate_examples(
```

This method will typically read source dataset artifacts (e.g. a CSV file) and
yield feature dictionaries that correspond to the features specified in
`DatasetInfo`.
yield (key, feature dictionary) tuples that correspond to the features specified
in `DatasetInfo`.

```python
def _generate_examples(self, images_dir_path, labels):
Expand All @@ -277,7 +276,7 @@ def _generate_examples(self, images_dir_path, labels):

# And yield examples as feature dictionaries
for image_id, description, label in data:
yield {
yield image_id, {
"image_description": description,
"image": "%s/%s.jpeg" % (images_dir_path, image_id),
"label": label,
Expand All @@ -289,6 +288,10 @@ format suitable for writing to disk (currently we use `tf.train.Example`
protocol buffers). For example, `tfds.features.Image` will copy out the
JPEG content of the passed image files automatically.

The key (here: `image_id`) should uniquely identify the record. It is used to
shuffle the dataset globally. If two records are yielded using the same key,
an exception will be raised during preparation of the dataset.

If you've implemented the test harness, your builder test should now pass.

### File access and `tf.io.gfile`
Expand Down Expand Up @@ -551,7 +554,13 @@ See
[TensorFlow code style guide](https://www.tensorflow.org/community/contribute/code_style)
for more information.

### 6. Send for review!
### 6. Add release notes

Add the dataset to the
[release notes](https://github.com/tensorflow/datasets/blob/master/docs/release_notes.md).
The release note will be published for the next release.

### 7. Send for review!

Send the pull request for review.

Expand Down
2 changes: 2 additions & 0 deletions docs/api_docs/python/_redirects.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@ redirects:
to: /datasets/api_docs/python/tfds/download/GenerateMode
- from: /datasets/api_docs/python/tfds/testing/FeatureExpectationsTestCase/failureException
to: /datasets/api_docs/python/tfds/testing/DatasetBuilderTestCase/failureException
- from: /datasets/api_docs/python/tfds/testing/SubTestCase/failureException
to: /datasets/api_docs/python/tfds/testing/DatasetBuilderTestCase/failureException
- from: /datasets/api_docs/python/tfds/testing/TestCase/failureException
to: /datasets/api_docs/python/tfds/testing/DatasetBuilderTestCase/failureException
- from: /datasets/api_docs/python/tfds/features/text
Expand Down
20 changes: 18 additions & 2 deletions docs/api_docs/python/_toc.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -32,12 +32,18 @@ toc:
path: /datasets/api_docs/python/tfds/core/DatasetBuilder
- title: DatasetInfo
path: /datasets/api_docs/python/tfds/core/DatasetInfo
- title: Experiment
path: /datasets/api_docs/python/tfds/core/Experiment
- title: GeneratorBasedBuilder
path: /datasets/api_docs/python/tfds/core/GeneratorBasedBuilder
- title: get_tfds_path
path: /datasets/api_docs/python/tfds/core/get_tfds_path
- title: lazy_imports
path: /datasets/api_docs/python/tfds/core/lazy_imports
- title: Metadata
path: /datasets/api_docs/python/tfds/core/Metadata
- title: MetadataDict
path: /datasets/api_docs/python/tfds/core/MetadataDict
- title: NamedSplit
path: /datasets/api_docs/python/tfds/core/NamedSplit
- title: SplitBase
Expand All @@ -50,6 +56,16 @@ toc:
path: /datasets/api_docs/python/tfds/core/SplitInfo
- title: Version
path: /datasets/api_docs/python/tfds/core/Version
- title: tfds.decode
section:
- title: Overview
path: /datasets/api_docs/python/tfds/decode
- title: Decoder
path: /datasets/api_docs/python/tfds/decode/Decoder
- title: make_decoder
path: /datasets/api_docs/python/tfds/decode/make_decoder
- title: SkipDecoding
path: /datasets/api_docs/python/tfds/decode/SkipDecoding
- title: tfds.download
section:
- title: Overview
Expand Down Expand Up @@ -88,8 +104,6 @@ toc:
path: /datasets/api_docs/python/tfds/features/Image
- title: Sequence
path: /datasets/api_docs/python/tfds/features/Sequence
- title: SequenceDict
path: /datasets/api_docs/python/tfds/features/SequenceDict
- title: Tensor
path: /datasets/api_docs/python/tfds/features/Tensor
- title: TensorInfo
Expand Down Expand Up @@ -146,6 +160,8 @@ toc:
path: /datasets/api_docs/python/tfds/testing/rm_tmp_dir
- title: run_in_graph_and_eager_modes
path: /datasets/api_docs/python/tfds/testing/run_in_graph_and_eager_modes
- title: SubTestCase
path: /datasets/api_docs/python/tfds/testing/SubTestCase
- title: TestCase
path: /datasets/api_docs/python/tfds/testing/TestCase
- title: test_main
Expand Down
10 changes: 9 additions & 1 deletion docs/api_docs/python/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,10 @@
* <a href="./tfds/core/BuilderConfig.md"><code>tfds.core.BuilderConfig</code></a>
* <a href="./tfds/core/DatasetBuilder.md"><code>tfds.core.DatasetBuilder</code></a>
* <a href="./tfds/core/DatasetInfo.md"><code>tfds.core.DatasetInfo</code></a>
* <a href="./tfds/core/Experiment.md"><code>tfds.core.Experiment</code></a>
* <a href="./tfds/core/GeneratorBasedBuilder.md"><code>tfds.core.GeneratorBasedBuilder</code></a>
* <a href="./tfds/core/Metadata.md"><code>tfds.core.Metadata</code></a>
* <a href="./tfds/core/MetadataDict.md"><code>tfds.core.MetadataDict</code></a>
* <a href="./tfds/core/NamedSplit.md"><code>tfds.core.NamedSplit</code></a>
* <a href="./tfds/core/SplitBase.md"><code>tfds.core.SplitBase</code></a>
* <a href="./tfds/core/SplitDict.md"><code>tfds.core.SplitDict</code></a>
Expand All @@ -19,6 +22,10 @@
* <a href="./tfds/core/Version.md"><code>tfds.core.Version</code></a>
* <a href="./tfds/core/get_tfds_path.md"><code>tfds.core.get_tfds_path</code></a>
* <a href="./tfds/core/lazy_imports.md"><code>tfds.core.lazy_imports</code></a>
* <a href="./tfds/decode.md"><code>tfds.decode</code></a>
* <a href="./tfds/decode/Decoder.md"><code>tfds.decode.Decoder</code></a>
* <a href="./tfds/decode/SkipDecoding.md"><code>tfds.decode.SkipDecoding</code></a>
* <a href="./tfds/decode/make_decoder.md"><code>tfds.decode.make_decoder</code></a>
* <a href="./tfds/disable_progress_bar.md"><code>tfds.disable_progress_bar</code></a>
* <a href="./tfds/download.md"><code>tfds.download</code></a>
* <a href="./tfds/download/ComputeStatsMode.md"><code>tfds.download.ComputeStatsMode</code></a>
Expand All @@ -37,7 +44,6 @@
* <a href="./tfds/features/FeaturesDict.md"><code>tfds.features.FeaturesDict</code></a>
* <a href="./tfds/features/Image.md"><code>tfds.features.Image</code></a>
* <a href="./tfds/features/Sequence.md"><code>tfds.features.Sequence</code></a>
* <a href="./tfds/features/SequenceDict.md"><code>tfds.features.SequenceDict</code></a>
* <a href="./tfds/features/Tensor.md"><code>tfds.features.Tensor</code></a>
* <a href="./tfds/features/TensorInfo.md"><code>tfds.features.TensorInfo</code></a>
* <a href="./tfds/features/Text.md"><code>tfds.features.Text</code></a>
Expand All @@ -64,6 +70,8 @@
* <a href="./tfds/testing/FeatureExpectationItem.md"><code>tfds.testing.FeatureExpectationItem</code></a>
* <a href="./tfds/testing/FeatureExpectationsTestCase.md"><code>tfds.testing.FeatureExpectationsTestCase</code></a>
* <a href="./tfds/testing/DatasetBuilderTestCase/failureException.md"><code>tfds.testing.FeatureExpectationsTestCase.failureException</code></a>
* <a href="./tfds/testing/SubTestCase.md"><code>tfds.testing.SubTestCase</code></a>
* <a href="./tfds/testing/DatasetBuilderTestCase/failureException.md"><code>tfds.testing.SubTestCase.failureException</code></a>
* <a href="./tfds/testing/TestCase.md"><code>tfds.testing.TestCase</code></a>
* <a href="./tfds/testing/DatasetBuilderTestCase/failureException.md"><code>tfds.testing.TestCase.failureException</code></a>
* <a href="./tfds/testing/make_tmp_dir.md"><code>tfds.testing.make_tmp_dir</code></a>
Expand Down
38 changes: 23 additions & 15 deletions docs/api_docs/python/tfds.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,15 @@

# Module: tfds

<table class="tfo-notebook-buttons tfo-api" align="left">
</table>

<a target="_blank" href="https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/__init__.py">View
source</a>

`tensorflow_datasets` (<a href="./tfds.md"><code>tfds</code></a>) defines a
collection of datasets ready-to-use with TensorFlow.

Defined in [`__init__.py`](https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/__init__.py).

<!-- Placeholder for "Used in" -->

Each dataset is defined as a <a href="./tfds/core/DatasetBuilder.md"><code>tfds.core.DatasetBuilder</code></a>, which encapsulates
Expand All @@ -22,47 +26,51 @@ The main library entrypoints are:
* <a href="./tfds/load.md"><code>tfds.load</code></a>: convenience method to construct a builder, download the data, and
create an input pipeline, returning a `tf.data.Dataset`.

Documentation:
#### Documentation:

* These API docs
* [Available datasets](https://github.com/tensorflow/datasets/tree/master/docs/datasets.md)
* [Colab tutorial](https://colab.research.google.com/github/tensorflow/datasets/blob/master/docs/overview.ipynb)
* [Add a dataset](https://github.com/tensorflow/datasets/tree/master/docs/add_dataset.md)
* These API docs
* [Available datasets](https://github.com/tensorflow/datasets/tree/master/docs/datasets.md)
* [Colab tutorial](https://colab.research.google.com/github/tensorflow/datasets/blob/master/docs/overview.ipynb)
* [Add a dataset](https://github.com/tensorflow/datasets/tree/master/docs/add_dataset.md)

## Modules

[`core`](./tfds/core.md) module: API to define datasets.

[`decode`](./tfds/decode.md) module: Decoder public API.

[`download`](./tfds/download.md) module: <a href="./tfds/download/DownloadManager.md"><code>tfds.download.DownloadManager</code></a> API.

[`features`](./tfds/features.md) module: <a href="./tfds/features/FeatureConnector.md"><code>tfds.features.FeatureConnector</code></a> API defining feature types.

[`file_adapter`](./tfds/file_adapter.md) module: <a href="./tfds/file_adapter/FileFormatAdapter.md"><code>tfds.file_adapter.FileFormatAdapter</code></a>s for GeneratorBasedBuilder.

[`units`](./tfds/units.md) module: Defines convenience constants/functions for converting various units.

[`testing`](./tfds/testing.md) module: Testing utilities.

[`units`](./tfds/units.md) module: Defines convenience constants/functions for
converting various units.

## Classes

[`class GenerateMode`](./tfds/download/GenerateMode.md): `Enum` for how to treat pre-existing downloads and data.

[`class percent`](./tfds/percent.md): Syntactic sugar for defining slice subsplits: `tfds.percent[75:-5]`.

[`class Split`](./tfds/Split.md): `Enum` for dataset splits.

[`class percent`](./tfds/percent.md): Syntactic sugar for defining slice subsplits: `tfds.percent[75:-5]`.

## Functions

[`as_numpy(...)`](./tfds/as_numpy.md): Converts a `tf.data.Dataset` to an iterable of NumPy arrays.

[`builder(...)`](./tfds/builder.md): Fetches a <a href="./tfds/core/DatasetBuilder.md"><code>tfds.core.DatasetBuilder</code></a> by string name.

[`list_builders(...)`](./tfds/list_builders.md): Returns the string names of all <a href="./tfds/core/DatasetBuilder.md"><code>tfds.core.DatasetBuilder</code></a>s.

[`load(...)`](./tfds/load.md): Loads the named dataset into a `tf.data.Dataset`.

[`disable_progress_bar(...)`](./tfds/disable_progress_bar.md): Disabled Tqdm
progress bar.

[`is_dataset_on_gcs(...)`](./tfds/is_dataset_on_gcs.md): If the dataset is
available on the GCS bucket gs://tfds-data/datasets.

[`list_builders(...)`](./tfds/list_builders.md): Returns the string names of all <a href="./tfds/core/DatasetBuilder.md"><code>tfds.core.DatasetBuilder</code></a>s.

[`load(...)`](./tfds/load.md): Loads the named dataset into a `tf.data.Dataset`.

Loading

0 comments on commit 56c032a

Please sign in to comment.