Fine-tuning GPT-2 with Spongebob (and NanoGPT)
This educational repository contains code for fine-tuning GPT-2 on dialogue from the TV series Spongebob Squarepants.
The code for fine-tuning GPT-2 comes from Karpathy's NanoGPT and uses PyTorch.
conda env create -f environment.yml
Spongebob transcripts can be scraped from the fandom website using the scrape.py
script and Selenium
. The scripts for each episode are stored in data
, and the full transcript of every episode is stored in spongebob_anthology.txt
.
The preprocess.py
script takes the full transcript and simply removes non-dialogue lines. The result is stored in spongebob_anthology_cleaned.txt
.
After a clean version of the transcript is created, navigate to data/spongebob/
and run the prepare.py
(from NanoGPT) script to make the training and test splits.
Use train.py
to train the model, with the either the following arguments
python train.py config/train_spongebob_char.py
to train from scratch and output toout_spongebob_char
.python train.py config/finetune.py
to train from GPT2 and output toout_spongebob_gpt
.
Use sample.py
to sample from the model
Example:
python sample.py --out_dur={specified output dir} --start="{starting text}" --num_samples={number of samples}