[Contributors Wanted] A `split_dataset` utility

It would be neat to have a utility to split datasets, somewhat similar to [this utility in sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).

Example:

```python
train_ds, val_ds = keras.utils.split_dataset(dataset, left_size=0.8, right_size=0.2)
```

Draft docstring (missing code examples, etc):

```python
def split_dataset(dataset, left_size=None, right_size=None, shuffle=False, seed=None):
    """Split a dataset into a left half and a right half (e.g. training / validation).
    
    Args:
        dataset: A `tf.data.Dataset` object or a list/tuple of arrays with the same length.
        left_size: If float, it should be in range `[0, 1]` range and signifies the fraction of the
            data to pack in the left dataset. If integer, it signifies the number of samples
            to pack in the left dataset. If `None`, it defaults to the complement to `right_size`.
        right_size: If float, it should be in range `[0, 1]` range and signifies the fraction of the
            data to pack in the right dataset. If integer, it signifies the number of samples
            to pack in the right dataset. If `None`, it defaults to the complement to `left_size`.
        shuffle: Boolean, whether to shuffle the data before splitting it.
        seed: A random seed for shuffling.

    Returns:
        A tuple of two `tf.data.Dataset` objects: the left and right splits.
    """
```

Notes:

- When processing a `Dataset`, it would first iterate over the dataset, put the samples in a list, then split the list and create two datasets from each side of the split list. If iterating over the dataset takes more than 10s (computed continuously while iterating), a warning should be printed that the utility is only meant for small datasets that fit in memory.
- Shuffling is done optionally before splitting (on the list / arrays). Not sure if we should apply `shuffle()` to the returned datasets
- Prefetching should be auto-tuned on the returned datasets
- At least one of left_size, right_size should be specified.
- If both are specified, we should check that they are complementary. If not that's an error.
- Feel free to suggest changes / additions to the API!

Interested in contributing it? Please open a PR or comment here for questions / suggestions!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Contributors Wanted] A `split_dataset` utility #16394

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Contributors Wanted] A split_dataset utility #16394

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Contributors Wanted] A `split_dataset` utility #16394