-
Notifications
You must be signed in to change notification settings - Fork 33
Modular dataset configuration #104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
cf74d2d
to
bb1b87f
Compare
Hi again, I think we may still be talking past each other, so I'd like to clarify my position. I really don't care whether we use
I completely disagree that this is a subjective design choice. It's not about preference. It's about whether contributors can easily understand and use the tagging mechanism.
I have to disagree here, too. This isn't just a matter of documentation. Ideally, contributors can look at a sum type in the codebase and immediately understand how to define their own.
When you say “it's equally opaque,” I think you're discounting the value of familiarity. Familiar patterns reduce friction, while unconventional ones add it. By choosing the familiar approach, we make the tagging mechanism approachable for everyone.
Yes, and I am here to tell you that the kwarg approach looks far scarier to most Python developers. Few people will have seen or used this pattern before, whereas tagging with class-level attributes is widely recognized and understood. In summary, I strongly believe that the kwarg approach introduces unnecessary barriers to understanding and collaboration and should be replaced with class-level attributes before this PR is merged. |
I still have a slight preference for kwargs, but I'll switch it back to a class attribute to avoid an empty debate about what a typical python developer may or may not find understandable.
This won't be solved here either way, the only way to make the mechanics clear is by documenting it.
Again highly subjective, but I won't argue further. |
How would I do that? Other than implicitly by complaining on duplicates? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jlamypoirier, Lgtm
✨ Description
A set of composable and dynamic dataset configuration classes, that allow defining arbitrary dataset definition schemes.
Dynamic configuration classes: experimental implementation in
GPTDatasetConfig
. if it works well we could use elsewhere, ex. for defining plugins.. It defines a class registry that is populated in__init_subclass__
, and works as long as the subclass is imported.The data config now has a
dataset
entry, which can be any dataset config. That dataset config may contain further nested dataset configs, etc.Datasets can be sampled or samplable. Both support
build_and_sample(config)
, but samplable (subclass of sampled) also supportsbuild()
. Indexed datasets are a special case of samplable datasets which also support indexing (get and len).The types for now are:
memmap
: A typical Megatron dataset, indexedconcatenated
: The concatenation of multiple indexed datasets if it were one. Currently unused.slice
: A contiguous slice of an indexed dataset, for subsampling or train/valid/test split.blended
: Blend sampled datasets according to the given probabilities.dummy
: Always returns the same sample. Only available as sampled.legacy
: Same as before this PR, for backward compatibility only. This is the only way to do dataset from json files, which we aim to replace with a concatenated one anyway.Datasets are defined are
data.datasets.[Phase]
Dataset classes may (and typically do) include nested dataset definitions.
Misc:
Breaking change:
sample
dataset source has been dropped since it's not that relevant. Otherwise configs are backward-compatible (for now).For future work:
shuffle
wrapper for indexed datasets?🔍 Type of change
Select all that apply: