Source code corresponding to the research paper: "Testing BERT for Generality in Cross-dataset Question Answering Performance", by Bootsma, Gaasbeek, 't Lam, Sekar and Weijts.
Training procedures used in this notebook based on Scheider's BERT training examples, and Devlin's paper introducing BERT.
We adopt the existing pre-trained model called BERT, which stands for Bidirectional Encoder Representations from Transformers, to create state-of-the-art models to work on a specific downstream task. The BERT model is a transformer-based machine learning technique for Natural Language Processing, pre-trained using unlabeled text on deep bidirectional representations. We fine-tune BERT-based models for Question Answering using different industry-standard datasets. Afterwards, we evaluate these models using evaluation sets of the other datasets, to test generality in cross-dataset Question Answering performance. We find that a model trained on a specific dataset outperforms othermodels on that specific evaluation set by a significant margin, even in very similar datasets.
The notebooks in this repository are intended to be run using Google Colab, using GPU acceleration. However they can easily be modified to run locally.
In order to fine-tune the BERT-base model used, the required training set needs to be selected, and the path where the weights.h5
is stored after training will need to be changed.
For evaluation. the Google drive paths of the weights and evaluation sets will need to be changed to point to the correct files. Versions of the dev sets of SQuAD 1.1, 2.0 and CoQA trimmed to only include questions with total tokenized lenght smaller than the 512 maximum sequence length of BERT are included in the repository. The predictions generated by this notebook can then be evaluated using the evaluation script provided by SQuAD 2.0
The full paper is included in this repository, and can be read here, or be downloaded from the repository.