[doc] Document the data machinery

We need to document the new dataa mechanism, introduced in #37, #40, #104, etc.

The doc should at least describe:
* The different types of datasets (samplable, sampled, indexed, etc.)
* The different dataset classes available for GPT, and generic ones.
* The dynamic dataset class instantiation mechanism (to move elsewhere if/once we generalize)
* How datasets are put together in `Data`
* Maybe also the dataset preparation mechanism.

It should let the user know how to:
* Prepare a dataset with the prepare command (recipe is already in the doc, might just need to point to it)
* Configure Fast-LLM for typical dataset use cases (ex. dataset in multiple files, train/valid/test split, blending multiple datasets, etc.)
* Extend GPT with a custom configurable dataset class
* Extend GPT data and/or datasets for a custom model (ex. `gpt_custom`)
* (Advanced, can be postponed) Create a data and dataset for a custom model that doesn't extend GPT, reusing the existing generic machinery. (Need to generalize dynamic configuration first)
* (Can be postponed) Create a custom data that doesn't reuse the dataset machinery. Easy but not sure how much we need it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[doc] Document the data machinery #124

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[doc] Document the data machinery #124

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions