Skip to content

[doc] Document the data machinery #124

Open
@jlamypoirier

Description

@jlamypoirier

We need to document the new dataa mechanism, introduced in #37, #40, #104, etc.

The doc should at least describe:

  • The different types of datasets (samplable, sampled, indexed, etc.)
  • The different dataset classes available for GPT, and generic ones.
  • The dynamic dataset class instantiation mechanism (to move elsewhere if/once we generalize)
  • How datasets are put together in Data
  • Maybe also the dataset preparation mechanism.

It should let the user know how to:

  • Prepare a dataset with the prepare command (recipe is already in the doc, might just need to point to it)
  • Configure Fast-LLM for typical dataset use cases (ex. dataset in multiple files, train/valid/test split, blending multiple datasets, etc.)
  • Extend GPT with a custom configurable dataset class
  • Extend GPT data and/or datasets for a custom model (ex. gpt_custom)
  • (Advanced, can be postponed) Create a data and dataset for a custom model that doesn't extend GPT, reusing the existing generic machinery. (Need to generalize dynamic configuration first)
  • (Can be postponed) Create a custom data that doesn't reuse the dataset machinery. Easy but not sure how much we need it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions