Improve dataset sampling #138
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
✨ Description
Fixes: #130, #131, #132
SamplingConfig
with a default inData
that can be overridden by asampled
wrapper on any (sampled dataset), ex. at the outermost dataset to configure a phase, or on a specific dataset to configure it independently. This is a bit more complicated than I would like but should be less annoying to work with than the alternatives.world_size
([feat] Speed up dataset sampling #132)shuffle_idx
) since it's redundant with document shuffling.Test sampling goes from 84s to 19s, so it's >4x improvement (see #132). With dataset cache it goes to 26s (not sure about before, probably proportional). And this is for single-gpu with default shuffling (worst case), the improvement is a lot better with parallel loading or reduced shuffling.
🔍 Type of change
Select all that apply: