Skip to content

[Contributors Wanted] A split_dataset utility #16394

@fchollet

Description

@fchollet

It would be neat to have a utility to split datasets, somewhat similar to this utility in sklearn.

Example:

train_ds, val_ds = keras.utils.split_dataset(dataset, left_size=0.8, right_size=0.2)

Draft docstring (missing code examples, etc):

def split_dataset(dataset, left_size=None, right_size=None, shuffle=False, seed=None):
    """Split a dataset into a left half and a right half (e.g. training / validation).
    
    Args:
        dataset: A `tf.data.Dataset` object or a list/tuple of arrays with the same length.
        left_size: If float, it should be in range `[0, 1]` range and signifies the fraction of the
            data to pack in the left dataset. If integer, it signifies the number of samples
            to pack in the left dataset. If `None`, it defaults to the complement to `right_size`.
        right_size: If float, it should be in range `[0, 1]` range and signifies the fraction of the
            data to pack in the right dataset. If integer, it signifies the number of samples
            to pack in the right dataset. If `None`, it defaults to the complement to `left_size`.
        shuffle: Boolean, whether to shuffle the data before splitting it.
        seed: A random seed for shuffling.

    Returns:
        A tuple of two `tf.data.Dataset` objects: the left and right splits.
    """

Notes:

  • When processing a Dataset, it would first iterate over the dataset, put the samples in a list, then split the list and create two datasets from each side of the split list. If iterating over the dataset takes more than 10s (computed continuously while iterating), a warning should be printed that the utility is only meant for small datasets that fit in memory.
  • Shuffling is done optionally before splitting (on the list / arrays). Not sure if we should apply shuffle() to the returned datasets
  • Prefetching should be auto-tuned on the returned datasets
  • At least one of left_size, right_size should be specified.
  • If both are specified, we should check that they are complementary. If not that's an error.
  • Feel free to suggest changes / additions to the API!

Interested in contributing it? Please open a PR or comment here for questions / suggestions!

Metadata

Metadata

Assignees

No one assigned

    Labels

    type:featureThe user is asking for a new feature.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions