Open
Description
We need to document the new dataa mechanism, introduced in #37, #40, #104, etc.
The doc should at least describe:
- The different types of datasets (samplable, sampled, indexed, etc.)
- The different dataset classes available for GPT, and generic ones.
- The dynamic dataset class instantiation mechanism (to move elsewhere if/once we generalize)
- How datasets are put together in
Data
- Maybe also the dataset preparation mechanism.
It should let the user know how to:
- Prepare a dataset with the prepare command (recipe is already in the doc, might just need to point to it)
- Configure Fast-LLM for typical dataset use cases (ex. dataset in multiple files, train/valid/test split, blending multiple datasets, etc.)
- Extend GPT with a custom configurable dataset class
- Extend GPT data and/or datasets for a custom model (ex.
gpt_custom
) - (Advanced, can be postponed) Create a data and dataset for a custom model that doesn't extend GPT, reusing the existing generic machinery. (Need to generalize dynamic configuration first)
- (Can be postponed) Create a custom data that doesn't reuse the dataset machinery. Easy but not sure how much we need it.