This repository contains all the necessary code for the ENLP course project at Georgetown University. It includes scripts for training baseline and curriculum learning models on the BabyLM dataset.
To download and prepare the dataset:
-
Download the BabyLM dataset:
wget https://github.com/babylm/babylm.github.io/raw/main/babylm_data.zip
-
Unzip the dataset in the project directory:
unzip babylm_data.zip
Install the required dependencies:
pip install -r requirements.txt
Clean the dataset before training the models:
python clean_data.py
To train the baseline models using the complete dataset:
python train_base.py --model_type <gpt, bert>
Replace <gpt, bert> with 'gpt' or 'bert' depending on the model you want to train.
To train models using curriculum learning:
python train.py --model_type <gpt, bert>
Again, replace <gpt, bert> with 'gpt' or 'bert' as required.