Skip to content

Improve dataset sampling #138

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Feb 12, 2025
Merged

Improve dataset sampling #138

merged 17 commits into from
Feb 12, 2025

Conversation

jlamypoirier
Copy link
Collaborator

@jlamypoirier jlamypoirier commented Jan 31, 2025

✨ Description

Fixes: #130, #131, #132

  • Add new sampling options. These allow shuffling epochs separately for consistent resampling ([feat] Consistent dataset re-sampling #130), and skipping the first epoch or disabling entirely for faster pre-sampling ([feat] Speed up dataset sampling #132).
  • Allow configuring sampling (ex. seed shuffling) separately for each phase and/or datasets ([feat] Option to configure sampling independently for each datasets #131). This takes the forma of a SamplingConfig with a default in Data that can be overridden by a sampled wrapper on any (sampled dataset), ex. at the outermost dataset to configure a phase, or on a specific dataset to configure it independently. This is a bit more complicated than I would like but should be less annoying to work with than the alternatives.
  • Distribute pre-sampling across all devices, for a speedup of up to world_size ([feat] Speed up dataset sampling #132)
  • New, much faster dataset pre-sampling method ([feat] Speed up dataset sampling #132):
    • Remove sample shuffling (shuffle_idx) since it's redundant with document shuffling.
    • Skip trivial document index for non-shuffled epochs.
    • Run pre-sampling with torch on gpu, which makes operations such as shuffling ~100x faster.
    • Instead of pre-computing the sample index with a loop, compute it on the fly using pre-computed cumsums. The cumsums are much faster to evaluate, and will usually take a lot less disk space because we only compute a fraction of them.
  • Add tests for the new cumsum method that use the old approach as a reference.
  • Add backward compatibility for the legacy format that uses the old approach. Legacy sampling is preserved, but the new format (since Modular dataset configuration #104) is changed, which should be ok since it's not really in use yet.

Test sampling goes from 84s to 19s, so it's >4x improvement (see #132). With dataset cache it goes to 26s (not sure about before, probably proportional). And this is for single-gpu with default shuffling (worst case), the improvement is a lot better with parallel loading or reduced shuffling.

🔍 Type of change

Select all that apply:

  • 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • 🚀 New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • 📝 Documentation change (updates documentation, including new content or typo fixes)
  • 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

@jlamypoirier jlamypoirier marked this pull request as ready for review February 7, 2025 04:19
@jlamypoirier jlamypoirier requested a review from tscholak February 7, 2025 05:01
@jlamypoirier jlamypoirier merged commit a637560 into main Feb 12, 2025
4 checks passed
@jlamypoirier jlamypoirier deleted the better_sampling branch February 12, 2025 23:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[feat] Consistent dataset re-sampling
1 participant