Skip to content

Modular dataset configuration #104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 56 commits into from
Jan 22, 2025
Merged

Modular dataset configuration #104

merged 56 commits into from
Jan 22, 2025

Conversation

jlamypoirier
Copy link
Collaborator

@jlamypoirier jlamypoirier commented Jan 6, 2025

✨ Description

A set of composable and dynamic dataset configuration classes, that allow defining arbitrary dataset definition schemes.

Dynamic configuration classes: experimental implementation in GPTDatasetConfig. if it works well we could use elsewhere, ex. for defining plugins.. It defines a class registry that is populated in __init_subclass__, and works as long as the subclass is imported.

The data config now has a dataset entry, which can be any dataset config. That dataset config may contain further nested dataset configs, etc.
Datasets can be sampled or samplable. Both support build_and_sample(config), but samplable (subclass of sampled) also supports build(). Indexed datasets are a special case of samplable datasets which also support indexing (get and len).

The types for now are:

  • memmap: A typical Megatron dataset, indexed
  • concatenated: The concatenation of multiple indexed datasets if it were one. Currently unused.
  • slice: A contiguous slice of an indexed dataset, for subsampling or train/valid/test split.
  • blended: Blend sampled datasets according to the given probabilities.
  • dummy: Always returns the same sample. Only available as sampled.
  • legacy: Same as before this PR, for backward compatibility only. This is the only way to do dataset from json files, which we aim to replace with a concatenated one anyway.

Datasets are defined are data.datasets.[Phase]
Dataset classes may (and typically do) include nested dataset definitions.

Misc:

  • Remove split dataset machinery.

Breaking change: sample dataset source has been dropped since it's not that relevant. Otherwise configs are backward-compatible (for now).

For future work:

  • Concatenate dataset from all files in directory.
  • Remove capitalization on phase names in config
  • Make phase names more flexible? (plain str instead of enum)
  • shuffle wrapper for indexed datasets?
  • Move dataset definition next to phase definition in train config?
  • More features.
  • Simplify things?

🔍 Type of change

Select all that apply:

  • 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • 🚀 New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • 📝 Documentation change (updates documentation, including new content or typo fixes)
  • 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

@jlamypoirier jlamypoirier marked this pull request as ready for review January 7, 2025 20:25
@jlamypoirier jlamypoirier requested a review from tscholak January 7, 2025 20:25
@jlamypoirier jlamypoirier mentioned this pull request Jan 15, 2025
8 tasks
@jlamypoirier jlamypoirier mentioned this pull request Jan 16, 2025
8 tasks
@jlamypoirier jlamypoirier changed the base branch from main to dataset_tweaks January 16, 2025 22:08
@tscholak
Copy link
Collaborator

tscholak commented Jan 20, 2025

Hi again,

I think we may still be talking past each other, so I'd like to clarify my position.

I really don't care whether we use __init_subclass__, a metaclass, or something else for the internal mechanism. That's an implementation detail, and it can live in a part of the codebase nobody ever touches again. The point I'm raising is about the interface contributors use to define and tag new configuration classes. This is what I care about:

  1. No contributor should need to understand the internal mechanics to define config classes. Whether it's __init_subclass__ or not behind the scenes, it should be abstracted away. Nobody should need to dig into this unless they are maintaining that very core mechanism. Pydantic is a good example: nobody looks at its metaclass implementation, yet the interface is clear and consistent.

  2. The tagging mechanism must be clear, conventional, and Pythonic. Using kwargs in class definitions isn't something any Python developers will expect. I've never seen this being used before, and I suspect others haven't either. Contributors need to understand how to define new configuration classes and tag them in a way that feels natural. Class-level attributes are well-known and widely used for tagging in Python. They are also familiar to developers who’ve worked with libraries like Pydantic.

As for the separate kwarg vs Classvar matter, this is really a design choice, we can discuss it in our next meeting.

I completely disagree that this is a subjective design choice. It's not about preference. It's about whether contributors can easily understand and use the tagging mechanism.

Either way, I think it's mostly a matter of documentation. We either have to tell users to add a class kwarg or a classvar, and in either case, it magically triggers a hidden part of the code, so it’s equally opaque to users.

I have to disagree here, too. This isn't just a matter of documentation. Ideally, contributors can look at a sum type in the codebase and immediately understand how to define their own.

  • A class-level attribute like kind: ClassVar[str] = "memmap" is intuitive and self-explanatory for most Python developers.
  • A kwarg in a class definition (e.g., type_="memmap") is not something anyone expects. It’s unconventional and requires explanation, which adds cognitive overhead.

When you say “it's equally opaque,” I think you're discounting the value of familiarity. Familiar patterns reduce friction, while unconventional ones add it. By choosing the familiar approach, we make the tagging mechanism approachable for everyone.

So the difference is really about the subjective matter of which one looks the scariest.

Yes, and I am here to tell you that the kwarg approach looks far scarier to most Python developers. Few people will have seen or used this pattern before, whereas tagging with class-level attributes is widely recognized and understood.

In summary, I strongly believe that the kwarg approach introduces unnecessary barriers to understanding and collaboration and should be replaced with class-level attributes before this PR is merged.

@jlamypoirier
Copy link
Collaborator Author

I still have a slight preference for kwargs, but I'll switch it back to a class attribute to avoid an empty debate about what a typical python developer may or may not find understandable.

  1. No contributor should need to understand the internal mechanics to define config classes. Whether it's __init_subclass__ or not behind the scenes, it should be abstracted away. Nobody should need to dig into this unless they are maintaining that very core mechanism. Pydantic is a good example: nobody looks at its metaclass implementation, yet the interface is clear and consistent.

This won't be solved here either way, the only way to make the mechanics clear is by documenting it.

  1. The tagging mechanism must be clear, conventional, and Pythonic. Using kwargs in class definitions isn't something any Python developers will expect. I've never seen this being used before, and I suspect others haven't either. Contributors need to understand how to define new configuration classes and tag them in a way that feels natural. Class-level attributes are well-known and widely used for tagging in Python. They are also familiar to developers who’ve worked with libraries like Pydantic.

Again highly subjective, but I won't argue further.

@jlamypoirier
Copy link
Collaborator Author

Alternatively, require subclasses to explicitly override the tag field, which can serve as both a type discriminator and a registration signal.

How would I do that? Other than implicitly by complaining on duplicates?

Copy link
Collaborator

@tscholak tscholak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jlamypoirier, Lgtm

Base automatically changed from dataset_tests to main January 22, 2025 03:02
@jlamypoirier jlamypoirier merged commit efb1afb into main Jan 22, 2025
4 checks passed
@jlamypoirier jlamypoirier deleted the modular_dataset branch January 22, 2025 05:29
@jlamypoirier jlamypoirier restored the modular_dataset branch January 23, 2025 21:04
@jlamypoirier jlamypoirier mentioned this pull request Feb 7, 2025
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants