The configuration process for DyGIE relies on the jsonnet
-based configuration system for AllenNLP. For more information on the AllenNLP configuration process in general, take a look at the AllenNLP guide.
DyGIE adds one layer of complexity on top of this. It factors the configuration into:
- Components that are common to all DyGIE models. These are defined in template.libsonnet.
- Components that are specific to single model trained on a particular dataset. These are contained in the
jsonnet
files in the training config directory. They use the jsonnet inheritance mechanism to extend the base class defined intemplate.libsonnet
. For more on jsonnet inheritance, see the jsonnet tutorial
The template.libsonnet file leaves three variables unset. These must be set by the inheriting object. For an example of how this works, see scierc_lightweight.jsonnet.
data_paths
: A dict with paths to the train, validation, and test sets.loss_weights
: Since DyGIE has a multitask objective, the individual losses are combined based on user-determined loss weights.target_task
: After each epoch, the AllenNLP trainer assesses dev set performance, and saves the model state that achieved the highest performance. Since DyGIE is multitask, the user must specify which task to use as the evaluation target. The options are [ner
,rel
,coref
, andevents
].
Note that if you create your own config outside of the training_config
directory, you'll need to modify the line
local template = import "template.libsonnet";
so that it points to the template file.
The user may also specify:
bert_model
: The name of a pretrained BERT model available on HuggingFace Transformers. The default isbert-base-cased
.max_span_width
: The maximum span length enumerated by the model. In pratice, 8 performs well.cuda_device
: By default, training is performed on CPU. To train on a GPU, specify a device.
TODO
TODO note that by default coref prop is turned off; need to turn it on here.
The jsonnet object inheritance model allows you to modify any (perhaps deeply-nested) field of the base object using +:
notation; see the jsonnet docs for more detail on this. For example, if you'd like to change the batch size and the learning rate on the optimizer, you could do:
template.DyGIE {
...
data_loader +: {
batch_size: 5
},
trainer +: {
optimizer +: {
lr: 5e-4
}
}
}
You can also add additional fields to the base class. For instance, if you'd like to train a model using an existing vocabulary you could add
template.DyGIE {
...
vocabulary: {
type: "from_files",
directory: [path_to_vocab_files]
}
}
Add these lines to the relevant .jsonnet
file:
dataset_reader +: {
token_indexers: {
tokens: {
type: "single_id"
}
}
},
model :+ {
embedder: {
token_embedders: {
tokens: {
type: "embedding",
embedding_dim: 100,
}
}
}
}
local template = import "template.libsonnet";
template.DyGIE {
// Required "hidden" fields.
data_paths: {
train: "data/scierc/processed_data/json/train.json",
validation: "data/scierc/processed_data/json/dev.json",
test: "data/scierc/processed_data/json/test.json",
},
loss_weights: {
ner: 1.0,
relation: 1.0,
coref: 0.0,
events: 0.0
},
target_task: "rel",
// Optional "hidden" fields
bert_model: "allenai/scibert_scivocab_cased",
cuda_device: 0,
max_span_width: 10,
// Modify the data loader and trainer.
data_loader +: {
batch_size: 5
},
trainer +: {
optimizer +: {
lr: 5e-4
}
},
// Specify an external vocabulary
vocabulary: {
type: "from_files",
directory: "vocab"
},
}