ENLP Course Project

This repository contains all the necessary code for the ENLP course project at Georgetown University. It includes scripts for training baseline and curriculum learning models on the BabyLM dataset.

Instructions to run

To download and prepare the dataset:

Download the BabyLM dataset:

wget https://github.com/babylm/babylm.github.io/raw/main/babylm_data.zip

Unzip the dataset in the project directory:
```
unzip babylm_data.zip
```

Install the required dependencies:

pip install -r requirements.txt

Clean the dataset before training the models:

python clean_data.py

Training Models

To train the baseline models using the complete dataset:

python train_base.py --model_type <gpt, bert>

Replace <gpt, bert> with 'gpt' or 'bert' depending on the model you want to train.

To train models using curriculum learning:

python train.py --model_type <gpt, bert>

Again, replace <gpt, bert> with 'gpt' or 'bert' as required.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
models_eval		models_eval
.gitignore		.gitignore
README.md		README.md
clean_data.py		clean_data.py
custom_pretraining_models.py		custom_pretraining_models.py
hf_training_loop.py		hf_training_loop.py
load_pretrain_data.py		load_pretrain_data.py
requirements.txt		requirements.txt
tokenizer.py		tokenizer.py
train.py		train.py
train_base.py		train_base.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ENLP Course Project

Instructions to run

Training Models

About

Uh oh!

Releases

Packages

Uh oh!

Languages

mahmedken/babylm

Folders and files

Latest commit

History

Repository files navigation

ENLP Course Project

Instructions to run

Training Models

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages