In this project, I aim to leverage General Question Answering(QA) language model for transfer learning of Biomedical Question Answering application.
I see the Stanford's Question Answering task on dataset(SQuAD) that includes more than 150,000 question-answer pairs and their corresponding context paragraphs from Wikipedia as the general QA task. For better matching the Biomedical QA task, I use the SQuAD dataset version 1.1. A typical sample of data in SQuAD 1.1 is as following:
question:"What was Maria Curie the first female recipient of?"
context:"One of the most famous people born in Warsaw was Maria Skłodowska-Curie, who achieved international recognition for her research on radioactivity and was the first female recipient of the Nobel Prize. Famous musicians include Władysław Szpilman and Frédéric Chopin. Though Chopin was born in the village of Żelazowa Wola, about 60 km (37 mi) from Warsaw, he moved to the city with his family when he was seven months old. Casimir Pulaski, a Polish general and hero of the American Revolutionary War, was born here in 1745."
answer:"Nobel Prize"
I see the BioASQ Task 7B as the Biomedic-domain-specific QA task. There are different types of QA pairs in the BioASQ dtaset: Yes/no questions, Factoid questions, List questions, Summary questions. I will only focus on sloving the task of factoid questions. A sample of factoid question is as below:
question:"Which R package could be used for the identification of pediatric brain tumors?"
context:"MethPed: an R package for the identification of pediatric brain tumor subtypes"
answer:"MethPed"
My approach for this task is to implement RoBERTa for both general and biomedical QA tasks. The outline of the whole precedures is below.
Computing platform: 1 GPU with 15G Memory on Kaggle
Programming language and framework: Python 3, PyTorch 1.6.1, Transfomers 2.11
My code is largely inspired by question answering application examples by Hugging Face.
In order to have the RoBERTa better represents the biomedical domain context, first, I pretrain the Vanilla RoBERTa model with biomedical corpus downloaded from PubMed Abstracts: ftp://ftp.ncbi.nlm.nih.gov/pubmed/baseline/. This procedure includes web scraping the text of biomedical publications, formatting and tokenizing the text, training RoBERTa with masked language modeling loss.The PubMed Abstracts includes 2.48 billion words. I only download 2.5% of the corpus for pretraining,which is 618M words. And the pretraining took more than 3 hours.
My code for web scraping PubMed corpus and processing text is here.
My code for pretraining RoBERTa is here.
After I have the pretrained RoBERTa model, I add QA head on the top of it to build RoBERTa QA model. Then, I fine-tune the model with SQuAD dataset, after that I fine-tune the model with BioASQ train dataset.The fine-tuning procedure is illustrated as below.
The fine-tuning processing is somewhat complicated. Thus I split the task into 3 subtasks.
- Load the SQuAD data from json file, convert the raw data to dataset which the model are able to take. Also I convert the BioASQ data into the same format as the SQuAD dataset. This part of code is here.
- Fine-tune the QA model with SQuAD dataset and BioASQ dateset, The fine-tuning took more than 1 hour. This part of code is here.
- Evaluate the model on the validation dataset(including retriving the prediction of answers). This part of code is here.
I use the same evaluation metrics as the SQuAD task, which are Exact Match(EM) and F1 Score. The results of my experiments are listed below.
Model | Pretrained with PubMed corpus(618M words) | Fine-tuned with SQuAD train dateset(89705 samples ) | Fine-tuned with BioASQ train dataset(3029 samples) | Evaluated on SQuAD dev dataset(10570 samples) | Evaluated on BioASQ dev dataset(460 samples) |
---|---|---|---|---|---|
RoBERTa-Base | NO | YES | NO | F1=89.46/EM=82.15 | F1=75.57/EM=60 |
RoBERTa-Base | NO | YES | YES | F1=80.65/EM=71.36 | F1=81.68/EM=66.3 |
RoBERTa-Base | YES | YES | NO | F1=88.95/EM=81.63 | F1=76.68/EM=60.65 |
RoBERTa-Base | YES | YES | YES | F1=78.75.46/EM=69.4 | F1=84.88/EM=72.6 |
Based on the results of my experiments, The model pretrained with PubMed corpus and fine-tuned with SQuQD and BioASQ dataset has the best score while tested on the BioASQ dev dataset. And the model without pretrained with PubMed corpus and only fine-tuned with SQuQD dataset gets the best score while tested on SQuQD dev dataset. The results indicate:
- Fine-tuning the QA model with general and domain-specific QA dataset remarkably improve the performance of domain-specific QA model.
- Pretraining the RoBERTa model with domain-specific corpus does improve the performance of domain-specific Question Answering model, but the improvement is not much. If I increase the volume of the pretraining corpus, it might be more helpful to improve the performance.
- An interesting oberservation is that pretraining the Model with PubMed corpus slightly drag the performance down for general QA task. Furthur more, fine-tuning the QA model with BioASQ dataset will significently hurt the model's performance for General QA task.
-
In this project, I only apply RoBERTa-Base to solve BioMedical QA task. I believe that training and ensembling mulitple models such as RoBERTa-large and Albert will improve the performance.
-
Pretraining the model with larger domanin-specific corpus will better deal with the domanin-specific QA task.