Modular dataset configuration #104

jlamypoirier · 2025-01-06T21:06:56Z

✨ Description

A set of composable and dynamic dataset configuration classes, that allow defining arbitrary dataset definition schemes.

Dynamic configuration classes: experimental implementation in GPTDatasetConfig. if it works well we could use elsewhere, ex. for defining plugins.. It defines a class registry that is populated in __init_subclass__, and works as long as the subclass is imported.

The data config now has a dataset entry, which can be any dataset config. That dataset config may contain further nested dataset configs, etc.
Datasets can be sampled or samplable. Both support build_and_sample(config), but samplable (subclass of sampled) also supports build(). Indexed datasets are a special case of samplable datasets which also support indexing (get and len).

The types for now are:

memmap: A typical Megatron dataset, indexed
concatenated: The concatenation of multiple indexed datasets if it were one. Currently unused.
slice: A contiguous slice of an indexed dataset, for subsampling or train/valid/test split.
blended: Blend sampled datasets according to the given probabilities.
dummy: Always returns the same sample. Only available as sampled.
legacy: Same as before this PR, for backward compatibility only. This is the only way to do dataset from json files, which we aim to replace with a concatenated one anyway.

Datasets are defined are data.datasets.[Phase]
Dataset classes may (and typically do) include nested dataset definitions.

Misc:

Remove split dataset machinery.

Breaking change: sample dataset source has been dropped since it's not that relevant. Otherwise configs are backward-compatible (for now).

For future work:

Concatenate dataset from all files in directory.
Remove capitalization on phase names in config
Make phase names more flexible? (plain str instead of enum)
shuffle wrapper for indexed datasets?
Move dataset definition next to phase definition in train config?
More features.
Simplify things?

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

tscholak · 2025-01-20T22:29:10Z

Hi again,

I think we may still be talking past each other, so I'd like to clarify my position.

I really don't care whether we use __init_subclass__, a metaclass, or something else for the internal mechanism. That's an implementation detail, and it can live in a part of the codebase nobody ever touches again. The point I'm raising is about the interface contributors use to define and tag new configuration classes. This is what I care about:

No contributor should need to understand the internal mechanics to define config classes. Whether it's __init_subclass__ or not behind the scenes, it should be abstracted away. Nobody should need to dig into this unless they are maintaining that very core mechanism. Pydantic is a good example: nobody looks at its metaclass implementation, yet the interface is clear and consistent.
The tagging mechanism must be clear, conventional, and Pythonic. Using kwargs in class definitions isn't something any Python developers will expect. I've never seen this being used before, and I suspect others haven't either. Contributors need to understand how to define new configuration classes and tag them in a way that feels natural. Class-level attributes are well-known and widely used for tagging in Python. They are also familiar to developers who’ve worked with libraries like Pydantic.

As for the separate kwarg vs Classvar matter, this is really a design choice, we can discuss it in our next meeting.

I completely disagree that this is a subjective design choice. It's not about preference. It's about whether contributors can easily understand and use the tagging mechanism.

Either way, I think it's mostly a matter of documentation. We either have to tell users to add a class kwarg or a classvar, and in either case, it magically triggers a hidden part of the code, so it’s equally opaque to users.

I have to disagree here, too. This isn't just a matter of documentation. Ideally, contributors can look at a sum type in the codebase and immediately understand how to define their own.

A class-level attribute like kind: ClassVar[str] = "memmap" is intuitive and self-explanatory for most Python developers.
A kwarg in a class definition (e.g., type_="memmap") is not something anyone expects. It’s unconventional and requires explanation, which adds cognitive overhead.

When you say “it's equally opaque,” I think you're discounting the value of familiarity. Familiar patterns reduce friction, while unconventional ones add it. By choosing the familiar approach, we make the tagging mechanism approachable for everyone.

So the difference is really about the subjective matter of which one looks the scariest.

Yes, and I am here to tell you that the kwarg approach looks far scarier to most Python developers. Few people will have seen or used this pattern before, whereas tagging with class-level attributes is widely recognized and understood.

In summary, I strongly believe that the kwarg approach introduces unnecessary barriers to understanding and collaboration and should be replaced with class-level attributes before this PR is merged.

jlamypoirier · 2025-01-20T23:03:13Z

I still have a slight preference for kwargs, but I'll switch it back to a class attribute to avoid an empty debate about what a typical python developer may or may not find understandable.

No contributor should need to understand the internal mechanics to define config classes. Whether it's __init_subclass__ or not behind the scenes, it should be abstracted away. Nobody should need to dig into this unless they are maintaining that very core mechanism. Pydantic is a good example: nobody looks at its metaclass implementation, yet the interface is clear and consistent.

This won't be solved here either way, the only way to make the mechanics clear is by documenting it.

The tagging mechanism must be clear, conventional, and Pythonic. Using kwargs in class definitions isn't something any Python developers will expect. I've never seen this being used before, and I suspect others haven't either. Contributors need to understand how to define new configuration classes and tag them in a way that feels natural. Class-level attributes are well-known and widely used for tagging in Python. They are also familiar to developers who’ve worked with libraries like Pydantic.

Again highly subjective, but I won't argue further.

jlamypoirier · 2025-01-20T23:18:50Z

Alternatively, require subclasses to explicitly override the tag field, which can serve as both a type discriminator and a registration signal.

How would I do that? Other than implicitly by complaining on duplicates?

tscholak

Thanks @jlamypoirier, Lgtm

jlamypoirier added 3 commits January 6, 2025 16:06

Modular dataset configuration

147e33b

fixes

c41a2c5

fix

e013ba2

jlamypoirier marked this pull request as ready for review January 7, 2025 20:25

jlamypoirier requested a review from tscholak January 7, 2025 20:25

tscholak and others added 8 commits January 9, 2025 12:17

Merge branch 'main' into modular_dataset

6b45944

Merge branch 'main' into modular_dataset

952a03d

Generalize indexed

82285ae

fix

7011ca3

Modularize fim, decouple data from dataset, basic tests, misc

9574715

Make tests pass

5532b97

Remove split datasets

5d5e0ab

Make tests pass

baacc4e

jlamypoirier mentioned this pull request Jan 13, 2025

[meta] Fast-LLM Improvements Tracker 🌟 #100

Closed

jlamypoirier added 2 commits January 13, 2025 14:31

misc

09640d8

misc

a73acf6

jlamypoirier mentioned this pull request Jan 15, 2025

Typing improvements #114

Merged

8 tasks

Fix merge

bb1b87f

jlamypoirier force-pushed the modular_dataset branch from cf74d2d to bb1b87f Compare January 15, 2025 19:38

jlamypoirier added 3 commits January 15, 2025 15:57

Type hints

148b448

misc

13e4f43

Dataset tweaks

0219006

jlamypoirier mentioned this pull request Jan 16, 2025

Dataset tweaks #118

Merged

8 tasks

Merge branch 'dataset_tweaks' into modular_dataset

b9b516f

jlamypoirier changed the base branch from main to dataset_tweaks January 16, 2025 22:08

jlamypoirier added 5 commits January 16, 2025 17:09

fix

8a33cef

misc

62fbe01

misc

1934828

misc

c0be45c

Merge branch 'dataset_tweaks' into modular_dataset

6358d08

fix

17e3aea

Drop class kwarg

ab2f468

tscholak approved these changes Jan 21, 2025

View reviewed changes

jlamypoirier added 3 commits January 21, 2025 20:19

fixes

27587a4

fixes

ca1f944

fixes

3c17819

Base automatically changed from dataset_tests to main January 22, 2025 03:02

jlamypoirier added 10 commits January 21, 2025 22:03

Merge branch 'dataset_tests' into modular_dataset

bd2fcec

Merge branch 'main' into modular_dataset

5405d42

Fix merge

0d8bf14

Fix merge

a0aae75

Fix merge

9dbbcf9

Fix merge

6dea63e

fixes

8245041

fixes

77b1324

fixes

54e5fa5

Match legacy

755c355

jlamypoirier merged commit efb1afb into main Jan 22, 2025
4 checks passed

jlamypoirier deleted the modular_dataset branch January 22, 2025 05:29

This was referenced Jan 22, 2025

[doc] Document the data machinery #124

Open

[bug] Crash with multiple dataloader workers #125

Closed

[feat] Generalize dynamic config classes #126

Open

jlamypoirier restored the modular_dataset branch January 23, 2025 21:04

jlamypoirier mentioned this pull request Jan 28, 2025

[feat] Integrate dataset re-weighting and preprocessing into Fast-LLM for streamlined data loading #25

Closed

jlamypoirier mentioned this pull request Feb 7, 2025

Improve dataset sampling #138

Merged

8 tasks

jlamypoirier mentioned this pull request Feb 19, 2025

Multi-Dataset Validation with LM-Loss #65

Closed

4 tasks

tscholak mentioned this pull request May 14, 2025

[Prototype] Generalize dynamic config classes #245

Draft

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Modular dataset configuration #104

Modular dataset configuration #104

Uh oh!

jlamypoirier commented Jan 6, 2025 •

edited

Loading

Uh oh!

tscholak commented Jan 20, 2025 •

edited by jlamypoirier

Loading

Uh oh!

jlamypoirier commented Jan 20, 2025

Uh oh!

jlamypoirier commented Jan 20, 2025

Uh oh!

tscholak left a comment

Uh oh!

Uh oh!

Uh oh!

Modular dataset configuration #104

Modular dataset configuration #104

Uh oh!

Conversation

jlamypoirier commented Jan 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

🔍 Type of change

Uh oh!

tscholak commented Jan 20, 2025 • edited by jlamypoirier Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jlamypoirier commented Jan 20, 2025

Uh oh!

jlamypoirier commented Jan 20, 2025

Uh oh!

tscholak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jlamypoirier commented Jan 6, 2025 •

edited

Loading

tscholak commented Jan 20, 2025 •

edited by jlamypoirier

Loading