Benchmarking for Biomedical Natural Language Processing Tasks with a Domain Specific ALBERT
This repository provides the pre-trained BioALBERT models, a biomedical language representation model trained on large domain specific (biomedical) corpora for designed for biomedical text mining tasks. Please refer to our paper [https://arxiv.org/abs/2107.04374] for more details.
We provide eight versions of pre-trained weights. Pre-training was based on the original ALBERT code, and training details are described in our paper (To be Published). Currently available versions of pre-trained weights are as follows:
-
BioALBERT-Base v1.0 (PubMed) - based on ALBERT-base Model
-
BioALBERT-Base v1.0 (PubMed + PMC) - based on ALBERT-base Model
-
BioALBERT-Base v1.0 (PubMed + MIMIC-III) - based on ALBERT-base Model
-
BioALBERT-Base v1.0 (PubMed + PMC + MIMIC-III) - based on ALBERT-base Model
-
BioALBERT-Large v1.1 (PubMed) - based on ALBERT-Large Model
-
BioALBERT-Large v1.1 (PubMed + PMC) - based on ALBERT-Large Model
-
BioALBERT-Large v1.1 (PubMed + MIMIC-III) - based on ALBERT-Large Model
-
BioALBERT-Large v1.1 (PubMed + PMC + MIMIC-III) - based on ALBERT-Large Model
Make sure to specify the version of the pre-trained weights used in your work.
The following sections introduce the installation and fine-tuning process of BioALBERT based on PyTorch (python version <= 3.7).
To fine-tune BioALBERT, you need to download BioALBERT pre-training weights. After downloading the pre-trained weights, install BioALBERT using requirements.txt as follows:
git clone https://github.com/usmaann/BioALBERT.git
cd BioALBERT; pip install -r requirements.txt
Note that this repository is based on the ALBERT repository by Google. See requirements.txt for other details.
Link | Detail |
---|---|
Paper | https://arxiv.org/abs/2107.04374 with [BibTex] |
(@misc{naseem2021benchmarking,
title={Benchmarking for Biomedical Natural Language Processing Tasks with a Domain Specific ALBERT},
author={Usman Naseem and Adam G. Dunn and Matloob Khushi and Jinman Kim},
year={2021},
eprint={2107.04374},
archivePrefix={arXiv},
primaryClass={cs.CL}
}) |
We provide a pre-processed version of benchmark datasets for each task as follows:
- BioASQ 4b
- BioASQ 5b
- BioASQ 6b
Open each links and download the datasets you need. For BioASQ datasets, please refer to the biobert repository
After downloading one of the pre-trained weights, unzip it to any directory you want, we will denote it as $BIOALBERT_DIR
. For example, when using BioALBERT-Base v1.0 (PubMed), set the BIOALBERT_DIR environment variable to:
$ export BIOALBERT_DIR=./BioALBERT_PUBMED_BASE
$ echo $BIOALBERT_DIR
>>> ./BioALBERT_PUBMED_BASE
Each datasets contains four files, which are dev.tsv
, test.tsv
, train_dev.tsv
, and train.tsv
. Simply download a dataset from NER and put these files into the directory called $NER_DIR
. Also, set $OUTPUT_DIR
as a directory for NER outputs. For example, when fine-tuning on the BC2GM dataset,
$ export NER_DIR=./datasets/NER/BC2GM
$ export OUTPUT_DIR=./NER_outputs
Following command runs fine-tuning code on NER with default arguments.
$ mkdir -p $OUTPUT_DIR
$ python run_ner.py --do_train=true --do_eval=true --vocab_file=$BIOALBERT_DIR/vocab.txt --bert_config_file=$BIOALBERT_DIR/bert_config.json --init_checkpoint=$BIOALBERT_DIR/model.ckpt-1000000 --num_train_epochs=10.0 --data_dir=$NER_DIR --output_dir=$OUTPUT_DIR
Each datasets contains there files, which are dev.tsv
, test.tsv
, and train.tsv
. Let $RE_DIR
denote the folder of a single RE data set, $TASK_NAME
denote the task name (two options: gad, euadr), and $OUTPUT_DIR
denote the RE output directory, take GAD as an example:
$ export RE_DIR=./datasets/RE/GAD/1
$ export TASK_NAME=gad
$ export OUTPUT_DIR=./re_outputs_1
Following command runs fine-tuning code on RE with default arguments.
$ python run_re.py --task_name=$TASK_NAME --do_train=true --do_eval=true --do_predict=true --vocab_file=$BIOBERT_DIR/vocab.txt --bert_config_file=$BIOBERT_DIR/bert_config.json --init_checkpoint=$BIOBERT_DIR/model.ckpt-1000000 --max_seq_length=128 --train_batch_size=32 --learning_rate=2e-5 --num_train_epochs=3.0 --do_lower_case=false --data_dir=$RE_DIR --output_dir=$OUTPUT_DIR
please refer to the biobert repository
If you have any questions, please submit a Github issue or contact Usman Naseem (usman.naseem@sydney.edu.au)