Skip to content

Testing BERT for Generality in Cross-dataset Question Answering Performance

Notifications You must be signed in to change notification settings

JelleBootsma/Cross-Dataset-QA-Performance

Repository files navigation

Cross-Dataset QA Performance

Source code corresponding to the research paper: "Testing BERT for Generality in Cross-dataset Question Answering Performance", by Bootsma, Gaasbeek, 't Lam, Sekar and Weijts.

Training procedures used in this notebook based on Scheider's BERT training examples, and Devlin's paper introducing BERT.

Abstract

We adopt the existing pre-trained model called BERT, which stands for Bidirectional Encoder Representations from Transformers, to create state-of-the-art models to work on a specific downstream task. The BERT model is a transformer-based machine learning technique for Natural Language Processing, pre-trained using unlabeled text on deep bidirectional representations. We fine-tune BERT-based models for Question Answering using different industry-standard datasets. Afterwards, we evaluate these models using evaluation sets of the other datasets, to test generality in cross-dataset Question Answering performance. We find that a model trained on a specific dataset outperforms othermodels on that specific evaluation set by a significant margin, even in very similar datasets.

Overview

The notebooks in this repository are intended to be run using Google Colab, using GPU acceleration. However they can easily be modified to run locally.

In order to fine-tune the BERT-base model used, the required training set needs to be selected, and the path where the weights.h5 is stored after training will need to be changed.

For evaluation. the Google drive paths of the weights and evaluation sets will need to be changed to point to the correct files. Versions of the dev sets of SQuAD 1.1, 2.0 and CoQA trimmed to only include questions with total tokenized lenght smaller than the 512 maximum sequence length of BERT are included in the repository. The predictions generated by this notebook can then be evaluated using the evaluation script provided by SQuAD 2.0

Paper

The full paper is included in this repository, and can be read here, or be downloaded from the repository.

About

Testing BERT for Generality in Cross-dataset Question Answering Performance

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published