This repository provides a package containing the full experimental pipeline for evaluating the effect of adaptive pre-training on language models to better represent the reading comprehension of specific populations.
More specifically, the goal is to determine whether the estimated surprisal from a population-adapted model is a better predictor of reading time than that from a baseline (non-adapted) model. Here, a population is defined as English L2 speakers with a specific L1 background: German, Italian, Mandarin, Portuguese, Russian, or Turkish. Reading times are drawn from participants in the MECO project (Kuperman et al., 2022). Adapted models are trained on a combination of learner data (a cleaned subcorpus of EFCAMDAT: Shatz, 2020) and geographically localized English corpora from participants’ countries of origin (CGLU: Dunn, 2020). For more information, this paper (link to be added) describes the methodology and evaluation used.
This package is not available on PyPI. To install:
- Clone this repository:
git clone https://github.com/joshramanathan/surprisal_llm_adaptation.git cd surprisal_llm_adaptation - Install the required dependencies:
pip install -r requirements.txt
It is recommended to use a virtual environment or conda environment.
Some corpora used for model adaptation are not included in this repository due to licensing restrictions. You will need to obtain them separately:
-
EFCAMDAT: Request access to the EF-Cambridge Open Language Database (EFCAMDAT) corpus here.
- Once you have access to the corpus, navigate to
Cleaned_Subcorpus (Shatz, 2020)and downloadFinal database (alternative prompts).xlsx. - Place this file in the data directory in
data/original_corpora/efcamdat.
- Once you have access to the corpus, navigate to
-
CGLU: The Corpus of Global Language Use (CGLU) v5.2 is available here.
-
MECO: The Multilingual Eye-movement Corpus (L2 release 2.0) is available here.
- Download the entire
release 2.0folder and place it indata/original_corpora/meco/.
- Download the entire
Additionally, if desired, the models directory created as an artifact from this experiment can be downloaded from the associated OSF page here, and placed in data.
data/
Directory containing all corpora and generated data. This may be named anything, but is data by default.
original_corpora/The directory in which downloaded corpora must be placed (as described above).cglu/German/Italian/Mandarin/Portuguese/Russian/Turkish/
efcamdat/meco/
surprisal_llm_adaptation/
The package directory.
__init__.pyrunner.pyContains theExperimentRunnerclass for orchestrating training and evaluation.surprisal_calc.pyContains theSurprisalCalculatorclassutilities.pyHelper variables and functions.
Once the corpora are organized as described above, you can run the full training and evaluation pipeline as conducted in the original experiment:
python main.pyThis will create all artifact files and directories entirely within data. If you wish to modify the data directory or other parameters, you can adjust the run_experiment arguments inside main.py.
Alternatively, you can use the package manually:
import surprisal_llm_adaptation
runner = surprisal_llm_adaptation.ExperimentRunner(model_id="EleutherAI/pythia-160m-deduped")You may replace model_id with any HuggingFace model identifier for an autoregressive language model (e.g., GPT-2, Pythia).
Optionally, you can also specify a different data_dir path, but the new data_dir must be organized as described in Directory Structure.
To begin adaptation and evaluation:
runner.run_experiment()Or call individual steps separately:
runner.get_efcamdat_dfs()
runner.get_cglu_dfs()
runner.combine_efcamdat_cglu()
runner.adapt_models()
runner.evaluate_model_production()
runner.evaluate_model_representation()
runner.get_meco_dfs()
runner.get_regression_dfs()
runner.fit_regression_models()
runner.graph_perplexities()
runner.graph_dlls()NOTE: full_pipeline() and adapt_models() may optionally be passed the following parameters:
- num_proc (
int, optional) - The number of processes (CPUs) to use in parallel during processing. Defaults to 24. - batch_size (
int, optional) - The number of samples to be processed together during tokenization. Defaults to 100. - per_device_train_batch_size (
int, optional) - The batch size to use per GPU. Defaults to 16. - block_size (
int, optional) - The number of tokens in a single block to be given to the model. Defaults to 2048.
It is recommended to adjust num_proc and batch_size according to your machine's capabilities. However, recreating the exact experiment necessarily requires the default parameters.
This project is licensed under the GNU General Public License. See the LICENSE file for details.