Lazy_dataset is a helper to deal with large datasets that do not fit into memory. It allows to define transformations that are applied lazily, (e.g. a mapping function to read data from HDD). When someone iterates over the dataset all transformations are applied.
Supported transformations:
dataset.map(map_fn)
: Apply the functionmap_fn
to each example (builtins.map)dataset[2]
: Get example at index2
.dataset['example_id']
Get that example that has the example id'example_id'
.dataset[10:20]
: Get a sub dataset that contains only the examples in the slice 10 to 20.dataset.filter(filter_fn, lazy=True)
Drops examples wherefilter_fn(example)
is false (builtins.filter).dataset.concatenate(*others)
: Concatenates two or more datasets (numpy.concatenate)dataset.intersperse(*others)
: Combine two or more datasets such that examples of each input dataset are evenly spaced (https://stackoverflow.com/a/19293603).dataset.zip(*others)
: Zip two or more datasetsdataset.shuffle(reshuffle=False)
: Shuffles the dataset. Whenreshuffle
isTrue
it shuffles each time when you iterate over the data.dataset.tile(reps, shuffle=False)
: Repeats the datasetreps
times and concatenates it (numpy.tile)dataset.cycle()
: Repeats the dataset endlessly (itertools.cycle but without caching)dataset.groupby(group_fn)
: Groups examples together. In contrast toitertools.groupby
a sort is not nessesary, like in pandas (itertools.groupby, pandas.DataFrame.groupby)dataset.sort(key_fn, sort_fn=sorted)
: Sorts the examples depending on the valueskey_fn(example)
(list.sort)dataset.batch(batch_size, drop_last=False)
: Batchesbatch_size
examples together as a list. Usually followed by a map (tensorflow.data.Dataset.batch)dataset.random_choice()
: Get a random example (numpy.random.choice)dataset.cache()
: Cache in RAM (similar to ESPnet'skeep_all_data_on_mem
)dataset.diskcache()
: Cache to a cache directory on the local filesystem (useful in clusters network slow filesystems)- ...
>>> from IPython.lib.pretty import pprint
>>> import lazy_dataset
>>> examples = {
... 'example_id_1': {
... 'observation': [1, 2, 3],
... 'label': 1,
... },
... 'example_id_2': {
... 'observation': [4, 5, 6],
... 'label': 2,
... },
... 'example_id_3': {
... 'observation': [7, 8, 9],
... 'label': 3,
... },
... }
>>> for example_id, example in examples.items():
... example['example_id'] = example_id
>>> ds = lazy_dataset.new(examples)
>>> ds
DictDataset(len=3)
MapDataset(_pickle.loads)
>>> ds.keys()
('example_id_1', 'example_id_2', 'example_id_3')
>>> for example in ds:
... print(example)
{'observation': [1, 2, 3], 'label': 1, 'example_id': 'example_id_1'}
{'observation': [4, 5, 6], 'label': 2, 'example_id': 'example_id_2'}
{'observation': [7, 8, 9], 'label': 3, 'example_id': 'example_id_3'}
>>> def transform(example):
... example['label'] *= 10
... return example
>>> ds = ds.map(transform)
>>> for example in ds:
... print(example)
{'observation': [1, 2, 3], 'label': 10, 'example_id': 'example_id_1'}
{'observation': [4, 5, 6], 'label': 20, 'example_id': 'example_id_2'}
{'observation': [7, 8, 9], 'label': 30, 'example_id': 'example_id_3'}
>>> ds = ds.filter(lambda example: example['label'] > 15)
>>> for example in ds:
... print(example)
{'observation': [4, 5, 6], 'label': 20, 'example_id': 'example_id_2'}
{'observation': [7, 8, 9], 'label': 30, 'example_id': 'example_id_3'}
>>> ds['example_id_2']
{'observation': [4, 5, 6], 'label': 20, 'example_id': 'example_id_2'}
>>> ds
DictDataset(len=3)
MapDataset(_pickle.loads)
MapDataset(<function transform at 0x7ff74efb6620>)
FilterDataset(<function <lambda> at 0x7ff74efb67b8>)
See here for a feature and throughput comparison of lazy_dataset with PyTorch's DataLoader.
Install it directly with Pip, if you just want to use it:
pip install lazy_dataset
If you want to make changes or want the most recent version: Clone the repository and install it as follows:
git clone https://github.com/fgnt/lazy_dataset.git
cd lazy_dataset
pip install --editable .