Skip to content

[feat] Option to configure sampling independently for each datasets #131

Closed
@jlamypoirier

Description

@jlamypoirier

🧐 Problem Description

We have a generic sampling configuration SamplingConfig that is set by the data and propagated to all the datasets. But some entries make sense to define at the dataset level, for example the seed, sequence_length (ex. override with a larger one for sampling consistency), shuffling (ex use different strategies when blending a tiny dataset with several epochs with a big one with < 1 epoch), etc.

💡 Proposed Solution

We'll want to add an optional override at the dataset level for a subset of GPTSamplingConfig.

Two options:

  1. Add a explicit sampled dataset wrapper that samples an indexed dataset with additional overridden arguments from GPTSamplingConfig. This works but is a bit verbose.
  2. Add a sampling field to all indexed datasets. More elegant, but needs extra care to deal with further wrapping (ex. if concatenating multiple datasets with different sampling strategies). I tend to prefer this one.

I tend to prefer 2, where 3 uses a subset of GPTSamplingConfig that makes sense to override at the dataset level, ex. seed, shuffling, sequence_length (override with a larger one for sampling consistency)

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions