Unexpected memory overhead for `GeneratorBasedBuilder` when `disable_shuffling = True`

**Description of issue**

Implementing a `GeneratorBasedBuilder` subclass for a large dataset leads to a large memory overhead when `disable_shuffling = True` and the `_generate_examples` implementation yields a sequence of `int` as the keys.

The reason for this is that generated examples are distributed across temporary bucket files, and the [`tfds.core.shuffle.get_bucket_number`](https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/shuffle.py#L70-L74) function mapping keys to bucket file numbers partitions the space of all 128-bit integers evenly across all buckets, resulting in all integers in `[0, 2**128 // BUCKETS_NUMBER]` being mapped to the first bucket. Then, [`Shuffler._iter_buckets`](https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/shuffle.py#L248-L253) attempts to read and sort all examples in the dataset (because they are stored in a single bucket).

One way around this is to make sure that keys are spaced so as to fill the space of 128-bit numbers as evenly as possible, for instance by leaving gaps of size `int(2 ** 128) // total_num_examples` between keys.

When `disable_shuffling = False` this is not an issue, since the keys are first hashed before being mapped to bucket numbers.

This is not a bug per se, but users should be made aware of this implementation detail, especially since the documentation for `GeneratorBasedBuilder` encourages users to use [image IDs or text file line numbers](​​https://github.com/tensorflow/datasets/blob/master/tensorflow_datasets/core/dataset_builder.py#L1089-L1090) for keys.

**Submit a pull request?**

I'm happy to submit a PR adding a note to that effect in the `GeneratorBasedBuilder` documentation.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unexpected memory overhead for `GeneratorBasedBuilder` when `disable_shuffling = True` #3491

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unexpected memory overhead for GeneratorBasedBuilder when disable_shuffling = True #3491

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Unexpected memory overhead for `GeneratorBasedBuilder` when `disable_shuffling = True` #3491