Description
Description of issue
Implementing a GeneratorBasedBuilder
subclass for a large dataset leads to a large memory overhead when disable_shuffling = True
and the _generate_examples
implementation yields a sequence of int
as the keys.
The reason for this is that generated examples are distributed across temporary bucket files, and the tfds.core.shuffle.get_bucket_number
function mapping keys to bucket file numbers partitions the space of all 128-bit integers evenly across all buckets, resulting in all integers in [0, 2**128 // BUCKETS_NUMBER]
being mapped to the first bucket. Then, Shuffler._iter_buckets
attempts to read and sort all examples in the dataset (because they are stored in a single bucket).
One way around this is to make sure that keys are spaced so as to fill the space of 128-bit numbers as evenly as possible, for instance by leaving gaps of size int(2 ** 128) // total_num_examples
between keys.
When disable_shuffling = False
this is not an issue, since the keys are first hashed before being mapped to bucket numbers.
This is not a bug per se, but users should be made aware of this implementation detail, especially since the documentation for GeneratorBasedBuilder
encourages users to use image IDs or text file line numbers for keys.
Submit a pull request?
I'm happy to submit a PR adding a note to that effect in the GeneratorBasedBuilder
documentation.