Systematic inference with DNNs

Setup

Install the simpletransformers library (anaconda/miniconda recommended):
https://simpletransformers.ai/

All code is in the folder code
The files train_lm.py and train_clf.py are in the folder code/simpletransformers.

Creating formal language datasets

Run:

python create_datasets.py -task *task* -voc *vocabulary*

Task is one of the following:
copy: map sequences to themselves
reverse: map sequences to their reversal
uppercase: map sequences to capitalized versions
count: map sequences to their token counts represented by a single number token
custom: map all sequences to a custom label given as the argument --custom_label

Vocabulary given as a plaintext string: e.g. ab

Some examples:

python create_datasets.py -task copy -voc ab
python create_datasets.py -task different -voc cd
python create_datasets.py -task custom -voc abc --custom_label X

By default, dataset is saved to data/task_vocabulary[:5]_int/src_tgt.txt
The int is to differentiate datasets with the same parameters to avoid overwriting.
Saving options can be modified via the arguments --save_folder, --params_fname, and data_fname.

Training a language model (LM)

Create dataset (see above). Then run:

python train_lm.py -d *paths_to_datasets* -m *model_type* (--pairs)

Multiple datasets can be given as argument to -d.
The training data is constructed from all arguments of -d.

Adding --pairs takes both src and tgt from the original data file into consideration when forming the LM training data.
Not adding it only uses the src for training the LM.

By default, the LM training data is saved in the folder lm/lm_training_data, and named as vocabulary[:5]_int.txt
Training data saving location be modified via the argument --save_data_dir.

By default, the LM is saved in the folder "lm", and named as model_vocabulary[:5]_int.txt
LM saving location can be modified via the argument --output_dir.

See simpletransformers documentation (above) for list of available LM types (e.g. BERT, RoBERTa, etc.).

Using cpu by default; switch to gpu with --use_cuda.

Training a classifier (clf)

Create dataset and train a LM on it (see above). Then run:

python train_clf.py -d *paths_to_datasets* -m *model_type* (--pairs)

Adding --pairs classifies sentence pairs rather than single sentences. Here, the label is based on the data file (the first -d argument gets the label 0, the second gets 1, etc.). Without --pairs the label is taken from tgt.

Using cpu by default; switch to gpu with --use_cuda.

Full pipeline for training clf from scratch

The example below does the following (default settings for dataset names based on task & vocabulary):\

creates two datasets
trains a LM from the datapoints of the datasets (both src and tgt here due to --pairs)
trains a clf from the same datasets, using the trained LM as the base model

python create_datasets.py -task copy -voc ab
python create_datasets.py -task different -voc cd

python simpletransformers/train_lm.py -d data/copy_ab_1/src_tgt.txt data/different_cd_1/src_tgt.txt -m bert --pairs

python simpletransformers/train_clf.py -d data/copy_ab_1/src_tgt.txt data/different_cd_1/src_tgt.txt -m bert -lm lm/bert_abcd_1 --pairs

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
code		code
reasoning-over-facts		reasoning-over-facts
README.md		README.md
system-dnn.yml		system-dnn.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Systematic inference with DNNs

Setup

Creating formal language datasets

Training a language model (LM)

Training a classifier (clf)

Full pipeline for training clf from scratch

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

yujiag21/systematicity-dnn

Folders and files

Latest commit

History

Repository files navigation

Systematic inference with DNNs

Setup

Creating formal language datasets

Training a language model (LM)

Training a classifier (clf)

Full pipeline for training clf from scratch

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages