-
Notifications
You must be signed in to change notification settings - Fork 19.6k
Closed
Labels
type:featureThe user is asking for a new feature.The user is asking for a new feature.
Description
It would be neat to have a utility to split datasets, somewhat similar to this utility in sklearn.
Example:
train_ds, val_ds = keras.utils.split_dataset(dataset, left_size=0.8, right_size=0.2)Draft docstring (missing code examples, etc):
def split_dataset(dataset, left_size=None, right_size=None, shuffle=False, seed=None):
"""Split a dataset into a left half and a right half (e.g. training / validation).
Args:
dataset: A `tf.data.Dataset` object or a list/tuple of arrays with the same length.
left_size: If float, it should be in range `[0, 1]` range and signifies the fraction of the
data to pack in the left dataset. If integer, it signifies the number of samples
to pack in the left dataset. If `None`, it defaults to the complement to `right_size`.
right_size: If float, it should be in range `[0, 1]` range and signifies the fraction of the
data to pack in the right dataset. If integer, it signifies the number of samples
to pack in the right dataset. If `None`, it defaults to the complement to `left_size`.
shuffle: Boolean, whether to shuffle the data before splitting it.
seed: A random seed for shuffling.
Returns:
A tuple of two `tf.data.Dataset` objects: the left and right splits.
"""Notes:
- When processing a
Dataset, it would first iterate over the dataset, put the samples in a list, then split the list and create two datasets from each side of the split list. If iterating over the dataset takes more than 10s (computed continuously while iterating), a warning should be printed that the utility is only meant for small datasets that fit in memory. - Shuffling is done optionally before splitting (on the list / arrays). Not sure if we should apply
shuffle()to the returned datasets - Prefetching should be auto-tuned on the returned datasets
- At least one of left_size, right_size should be specified.
- If both are specified, we should check that they are complementary. If not that's an error.
- Feel free to suggest changes / additions to the API!
Interested in contributing it? Please open a PR or comment here for questions / suggestions!
Metadata
Metadata
Assignees
Labels
type:featureThe user is asking for a new feature.The user is asking for a new feature.