Skip to content

Unexpected memory overhead for GeneratorBasedBuilder when disable_shuffling = True #3491

Open
@vdumoulin

Description

@vdumoulin

Description of issue

Implementing a GeneratorBasedBuilder subclass for a large dataset leads to a large memory overhead when disable_shuffling = True and the _generate_examples implementation yields a sequence of int as the keys.

The reason for this is that generated examples are distributed across temporary bucket files, and the tfds.core.shuffle.get_bucket_number function mapping keys to bucket file numbers partitions the space of all 128-bit integers evenly across all buckets, resulting in all integers in [0, 2**128 // BUCKETS_NUMBER] being mapped to the first bucket. Then, Shuffler._iter_buckets attempts to read and sort all examples in the dataset (because they are stored in a single bucket).

One way around this is to make sure that keys are spaced so as to fill the space of 128-bit numbers as evenly as possible, for instance by leaving gaps of size int(2 ** 128) // total_num_examples between keys.

When disable_shuffling = False this is not an issue, since the keys are first hashed before being mapped to bucket numbers.

This is not a bug per se, but users should be made aware of this implementation detail, especially since the documentation for GeneratorBasedBuilder encourages users to use image IDs or text file line numbers for keys.

Submit a pull request?

I'm happy to submit a PR adding a note to that effect in the GeneratorBasedBuilder documentation.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions