[feat] Option to configure sampling independently for each datasets

# 🧐 Problem Description

We have a generic sampling configuration `SamplingConfig` that is set by the data and propagated to all the datasets. But some entries make sense to define at the dataset level, for example the `seed`, `sequence_length` (ex. override with a larger one for sampling consistency), shuffling (ex use different strategies when blending a tiny dataset with several epochs with a big one with < 1 epoch), etc.


# 💡 Proposed Solution

We'll want to add an optional override at the dataset level for a subset of `GPTSamplingConfig`.

Two options:
1. Add a explicit `sampled` dataset wrapper that samples an indexed dataset with additional overridden arguments from `GPTSamplingConfig`. This works but is a bit verbose.
2. Add a `sampling` field to all indexed datasets. More elegant, but needs extra care to deal with further wrapping (ex. if concatenating multiple datasets with different sampling strategies). I tend to prefer this one.

I tend to prefer 2, where 3 uses a subset of `GPTSamplingConfig` that makes sense to override at the dataset level, ex. `seed`, shuffling, `sequence_length` (override with a larger one for sampling consistency)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feat] Option to configure sampling independently for each datasets #131

🧐 Problem Description

💡 Proposed Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[feat] Option to configure sampling independently for each datasets #131

Description

🧐 Problem Description

💡 Proposed Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions