Closed
Description
🧐 Problem Description
We have a generic sampling configuration SamplingConfig
that is set by the data and propagated to all the datasets. But some entries make sense to define at the dataset level, for example the seed
, sequence_length
(ex. override with a larger one for sampling consistency), shuffling (ex use different strategies when blending a tiny dataset with several epochs with a big one with < 1 epoch), etc.
💡 Proposed Solution
We'll want to add an optional override at the dataset level for a subset of GPTSamplingConfig
.
Two options:
- Add a explicit
sampled
dataset wrapper that samples an indexed dataset with additional overridden arguments fromGPTSamplingConfig
. This works but is a bit verbose. - Add a
sampling
field to all indexed datasets. More elegant, but needs extra care to deal with further wrapping (ex. if concatenating multiple datasets with different sampling strategies). I tend to prefer this one.
I tend to prefer 2, where 3 uses a subset of GPTSamplingConfig
that makes sense to override at the dataset level, ex. seed
, shuffling, sequence_length
(override with a larger one for sampling consistency)