Improve dataset sampling #138

jlamypoirier · 2025-01-31T00:49:31Z

✨ Description

Add new sampling options. These allow shuffling epochs separately for consistent resampling ([feat] Consistent dataset re-sampling #130), and skipping the first epoch or disabling entirely for faster pre-sampling ([feat] Speed up dataset sampling #132).
Allow configuring sampling (ex. seed shuffling) separately for each phase and/or datasets ([feat] Option to configure sampling independently for each datasets #131). This takes the forma of a SamplingConfig with a default in Data that can be overridden by a sampled wrapper on any (sampled dataset), ex. at the outermost dataset to configure a phase, or on a specific dataset to configure it independently. This is a bit more complicated than I would like but should be less annoying to work with than the alternatives.
Distribute pre-sampling across all devices, for a speedup of up to world_size ([feat] Speed up dataset sampling #132)
New, much faster dataset pre-sampling method ([feat] Speed up dataset sampling #132):
- Remove sample shuffling (shuffle_idx) since it's redundant with document shuffling.
- Skip trivial document index for non-shuffled epochs.
- Run pre-sampling with torch on gpu, which makes operations such as shuffling ~100x faster.
- Instead of pre-computing the sample index with a loop, compute it on the fly using pre-computed cumsums. The cumsums are much faster to evaluate, and will usually take a lot less disk space because we only compute a fraction of them.
Add tests for the new cumsum method that use the old approach as a reference.
Add backward compatibility for the legacy format that uses the old approach. Legacy sampling is preserved, but the new format (since Modular dataset configuration #104) is changed, which should be ok since it's not really in use yet.

Test sampling goes from 84s to 19s, so it's >4x improvement (see #132). With dataset cache it goes to 26s (not sure about before, probably proportional). And this is for single-gpu with default shuffling (worst case), the improvement is a lot better with parallel loading or reduced shuffling.

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

This reverts commit 4658f8f.

jlamypoirier added 14 commits January 30, 2025 19:22

better_sampling

4658f8f

Revert "better_sampling"

efb0783

This reverts commit 4658f8f.

better_sampling

f0bf5b3

prototype

fe75d6e

cleanup

713746d

stuff

bf466b4

Merge branch 'main' into better_sampling

9d79865

Merge remote-tracking branch 'origin/main' into better_sampling

6ba9a0a

Fix merge

33f10b6

Fix merge

f5d0434

fixes

8fa5584

Fixes, cleanup

3f502b7

fixes

85c4b9b

fix

805160f

jlamypoirier marked this pull request as ready for review February 7, 2025 04:19

jlamypoirier requested a review from tscholak February 7, 2025 05:01

jlamypoirier mentioned this pull request Feb 7, 2025

Make triton optional, fix legacy dataset #144

Merged

8 tasks

jlamypoirier added 3 commits February 10, 2025 17:13

Merge branch 'main' into better_sampling

23f7dd4

Fix merge

631154b

fix

7554422

jlamypoirier merged commit a637560 into main Feb 12, 2025
4 checks passed

jlamypoirier deleted the better_sampling branch February 12, 2025 23:37

jlamypoirier mentioned this pull request Feb 19, 2025

Configuration override mechanism #154

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve dataset sampling #138

Improve dataset sampling #138

Uh oh!

jlamypoirier commented Jan 31, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Improve dataset sampling #138

Improve dataset sampling #138

Uh oh!

Conversation

jlamypoirier commented Jan 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

🔍 Type of change

Uh oh!

Uh oh!

Uh oh!

jlamypoirier commented Jan 31, 2025 •

edited

Loading