If you have a tf.train.Example
proto (inside .tfrecord
, .riegeli
,...),
which has been generated by third party tools, that you would like to directly
load with tfds API, then this page is for you.
In order to load your .tfrecord
files, you only need to:
- Follow the TFDS naming convention.
- Add metadata files (
dataset_info.json
,features.json
) along your tfrecord files.
Limitations:
tf.train.SequenceExample
is not supported, onlytf.train.Example
.- You need to be able to express the
tf.train.Example
in terms oftfds.features
(see section bellow).
In order for your .tfrecord
files to be detected by TFDS, they need to follow
the following naming convention:
<dataset_name>-<split_name>.<file-extension>-xxxxx-of-yyyyy
For example, MNIST has the following files:
mnist-test.tfrecord-00000-of-00001
mnist-train.tfrecord-00000-of-00001
For TFDS to be able to decode the tf.train.Example
proto, you need to provide
the tfds.features
sturcture matching your specs. For example:
features = tfds.features.FeaturesDict({
'image': tfds.features.Image(shape=(256, 256, 3)),
'label': tfds.features.ClassLabel(names=['dog', 'cat'])
'objects': tfds.features.Sequence({
'camera/K': tfds.features.Tensor(shape=(3,), dtype=tf.float32),
}),
})
Corresponds to the following tf.train.Example
specs:
{
'image': tf.io.FixedLenFeature(shape=(), dtype=tf.string),
'label': tf.io.FixedLenFeature(shape=(), dtype=tf.int64),
'objects/camera/K': tf.io.FixedLenSequenceFeature(shape=(3,), dtype=tf.int64),
}
Specifying the features allow TFDS to automatically decode images, video,...
Like any other TFDS datasets, features metadata (e.g. label names,...) will be
exposed to the user (e.g. info.features['label'].names
).
If you're not sure what your tfds.features
translates into tf.train.Example
,
you can experiment in colab:
-
To translate
tfds.features
into the human readable structure of thetf.train.Example
, you can callfeatures.get_serialized_info()
. -
To get the exact
FixedLenFeature
,... spec passed totf.io.parse_single_example
, you can use the following code snippet:example_specs = features.get_serialized_info() parser = tfds.core.example_parser.ExampleParser(example_specs) nested_feature_specs = parser._build_feature_specs() feature_specs = tfds.core.utils.flatten_nest_dict(nested_feature_specs)
TFDS requires to know the exact number of example within each shard. This is
required for features like len(ds)
, or the
subplit API:
split='train[75%:]'
.
-
If you have this information, you can explicitly create a list of
tfds.core.SplitInfo
and skip to the next section:split_infos = [ tfds.core.SplitInfo( name='train', shard_lengths=[1024, ...], # Num of examples in shard0, shard1,... num_bytes=0, # Total size of your dataset (if unknown, set to 0) ), tfds.core.SplitInfo(name='test', ...), ]
-
If you do not know this information, you can compute it using the
compute_split_info.py
script (or in your own script withtfds.folder_dataset.compute_split_info
). It will launch a beam pipeline which will read all shards on the given directory and compute the info.
To automatically add the proper metadata files along your dataset, use
tfds.core.write_metadata
:
tfds.folder_dataset.write_metadata(
builder_dir='/path/to/my/dataset/1.0.0/',
features=features,
# Pass the `out_dir` argument of compute_split_info (see section above)
# You can also explicitly pass a list of `tfds.core.SplitInfo`
split_infos='/path/to/my/dataset/1.0.0/',
# Optionally, additional DatasetInfo metadata can be provided
homepage='http://my-project.org',
)
Once the function has been called once on your dataset directory, metadata files
( dataset_info.json
,...) have been added and your datasets are ready to be
loaded with TFDS (see next section).
Once the metadata have been generated, datasets can be loaded using
tfds.core.builder_from_directory
which returns a tfds.core.DatasetBuilder
with the standard TFDS API (like tfds.builder
):
builder = tfds.core.builder_from_directory('~/path/to/my_dataset/3.0.0/')
# Metadata are avalailable as usual
builder.info.splits['train'].num_examples
# Construct the tf.data.Dataset pipeline
ds = builder.as_dataset(split='train[75%:]')
for ex in ds:
...
For better compatibility with TFDS, you can organize your data as
<data_dir>/<dataset_name>[/<dataset_config>]/<dataset_version>
. For example:
data_dir/
dataset0/
1.0.0/
1.0.1/
dataset1/
config0/
2.0.0/
config1/
2.0.0/
This will make your datasets compatible with the tfds.load
/ tfds.builder
API, simply by providing data_dir/
:
ds0 = tfds.load('dataset0', data_dir='data_dir/')
ds1 = tfds.load('dataset1/config0', data_dir='data_dir/')