-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Description of issue
Implementing a GeneratorBasedBuilder subclass for a large dataset leads to a large memory overhead when disable_shuffling = True and the _generate_examples implementation yields a sequence of int as the keys.
The reason for this is that generated examples are distributed across temporary bucket files, and the tfds.core.shuffle.get_bucket_number function mapping keys to bucket file numbers partitions the space of all 128-bit integers evenly across all buckets, resulting in all integers in [0, 2**128 // BUCKETS_NUMBER] being mapped to the first bucket. Then, Shuffler._iter_buckets attempts to read and sort all examples in the dataset (because they are stored in a single bucket).
One way around this is to make sure that keys are spaced so as to fill the space of 128-bit numbers as evenly as possible, for instance by leaving gaps of size int(2 ** 128) // total_num_examples between keys.
When disable_shuffling = False this is not an issue, since the keys are first hashed before being mapped to bucket numbers.
This is not a bug per se, but users should be made aware of this implementation detail, especially since the documentation for GeneratorBasedBuilder encourages users to use image IDs or text file line numbers for keys.
Submit a pull request?
I'm happy to submit a PR adding a note to that effect in the GeneratorBasedBuilder documentation.