GPT Model Training

This project trains a GPT model using PyTorch to process text data. It includes tokenization, dataset preparation, and a training loop with evaluation.

Features

Implements a configurable GPT model using torch.nn for text sequence generation.
Utilizes torch.utils.data for efficient dataset management.
Supports dynamic configuration for model and dataset parameters using Pydantic.

Requirements

Python 3.10+
Dependencies:
- torch
- tqdm
- pydantic

Install dependencies using pip:

pip install torch torchvision pydantic tqdm tensorboard

File Structure

blocks.py: Defines the building blocks of the GPT model and Transformer model.
gpt.py: Defines the GPT model.
tokenizer.py: Defines Tokenizer class for encoding and decoding text.
tokoenization.py: Defines functions for tokenization and dataset preparation.
train_gpt.py: Main script for training and evaluation for GPT model.

Usage

Command-Line Arguments

Argument	Description	Default
`--tokenizer`	Path to the tokenizer file	`./toy_data/tiny_sp`
`--train_data`	Path to the training data	`./toy_data/tiny_sp_train.txt`
`--eval_data`	Path to the evaluation data	`./toy_data/tiny_sp_test.txt`
`--epochs`	Number of training epochs	`100`
`--embed_dim`	Embedding dimension	`384`
`--tgt_vocab_size`	Target vocabulary size	`384`
`--seq_len`	Sequence length	`256`
`--num_layers`	Number of transformer layers	`3`
`--expansion_factor`	Feedforward expansion factor	`2`
`--n_heads`	Number of attention heads	`3`
`--experiment_name`	Name of the experiment (logs saved under `runs/experiment_name`)	None
`--batch_size`	Training batch size	`64`
`--shuffle`	Shuffle dataset	`True`

Running the Training Script

python train.py --tokenizer ./path/to/tokenizer --train_data ./path/to/train.txt --eval_data ./path/to/eval.txt --epochs 50 --experiment_name my_experiment

Output

Model checkpoints are saved every 5 epochs as <experiment_name>_e<epoch>.pth.

Configurations

Model Config (`ModelConfig`)

Parameter	Description	Default
`embed_dim`	Embedding dimension	`384`
`tgt_vocab_size`	Target vocabulary size	`384`
`seq_len`	Sequence length	`256`
`num_layers`	Number of transformer layers	`6`
`expansion_factor`	Feedforward expansion factor	`4`
`n_heads`	Number of attention heads	`6`

Dataset Config (`DatasetConfig`)

Parameter	Description	Default
`batch_size`	Batch size	`64`
`shuffle`	Shuffle the dataset	`True`

Customization

You can modify gpt.py or tokenizer.py and/or tokenization.py to use a different model architecture or tokenizer setup as needed.

License

This project is open-source and licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPT Model Training

Features

Requirements

File Structure

Usage

Command-Line Arguments

Running the Training Script

Output

Configurations

Model Config (`ModelConfig`)

Dataset Config (`DatasetConfig`)

Customization

License

About

Releases

Packages

Languages

makermotion/transformers

Folders and files

Latest commit

History

Repository files navigation

GPT Model Training

Features

Requirements

File Structure

Usage

Command-Line Arguments

Running the Training Script

Output

Configurations

Model Config (ModelConfig)

Dataset Config (DatasetConfig)

Customization

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Model Config (`ModelConfig`)

Dataset Config (`DatasetConfig`)

Packages