Additional Masked Language Model Pretrain for BERT

Pretrained language models such as BERT have shown to be superb on many NLP tasks. However, it is possible that your NLP task can be related to one specific domain (eg. finance, science and medicine) in which case you might consider taking additional MLM pretraining on a large corpus. (one related to the task's domain)

This method is known as Domain Adaptive Pretraining, as suggest from Don't Stop Pretraining: Adapt Language Models to Domains and Tasks which has shown to be benenficial for various NLP tasks in terms of precision.

In this repo, I have used two variants of Korean BERT

Korean BERT from SKT-Brain, with 12-layer of Transformer block
Its distilled version, DistilBERT, with 3-layer of Transformer block

Then, I have carried out additional MLM pretraining with Korean NLI dataset, released from KakaoBrain where each row is of sent1 \t sent2 \t label

Process

1. Preprocess of NLI data

Place your own data under ./data directory
Depending on the format of your data (whether each row is of single sentence or multiple sentences) you might need to modify preprocess.py

2. Additional PreTrain

run train.py
distill argument is set to True by default, in which case a distilled Korean BERT is pretrained. You need to change it to False in order to pretrain 12-layered BERT

MLM Pretraining

Input Sequence  : The man went to [MASK] store with [MASK] dog
Target Sequence :                  the                his

Rules:

Randomly 15% of input token will be changed into something, based on under sub-rules

Randomly 80% of tokens, gonna be a [MASK] token
Randomly 10% of tokens, gonna be a [RANDOM] token(another word)
Randomly 10% of tokens, will be remain as same. But need to be predicted.

Acknowledgement

codertimo

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
experiment		experiment
README.md		README.md
constant.py		constant.py
dataset.py		dataset.py
kobert_tokenizer.py		kobert_tokenizer.py
logger.py		logger.py
model.py		model.py
preprocess.py		preprocess.py
train.py		train.py
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Additional Masked Language Model Pretrain for BERT

Process

1. Preprocess of NLI data

2. Additional PreTrain

MLM Pretraining

Rules:

Acknowledgement

About

Releases

Packages

Languages

robinsongh381/Additional_MLM_Pretrain_for_BERT

Folders and files

Latest commit

History

Repository files navigation

Additional Masked Language Model Pretrain for BERT

Process

1. Preprocess of NLI data

2. Additional PreTrain

MLM Pretraining

Rules:

Acknowledgement

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages