- This is a 4-week long course that helps you understand the rich hugging face ecosystem.
- Slides
- Notebook
- Loading datasets in different formats
- Understand the Structure of the loaded dataset
- Access and manipulate samples in the dataset
- Concatenate
- Interleave
- Map
- Filter
- Slides
- Notebook
- Set up a tokenization pipeline
- Train the tokenizer
- Encode the input samples (single or batch)
- Test the implementation
- Save and load the tokenizer
- Decode the token_ids
- Wrap the tokenizer with the
PreTrainedTokenizer
class - Save the pre-trained tokenizer
- Slides
- Notebook
- Download the model checkpoints: Here
- Set up the training pipeline
- Dataset: BookCorpus
- Number of tokens: 1.08 billion
- Tokenizer: gpt-2 tokenizer
- Model: gpt-2 with CLM head
- Optimizer: AdamW
- Parallelism: DDP (with L4 GPUs)
- Train the model on
- A100 80 GB single GPU
- L4 48 GB single node Multiple GPUs
- V100 32 GB single GPU
- Training Report at wandb
- Text Generation