Pretrained language models such as BERT have shown to be superb on many NLP tasks. However, it is possible that your NLP task can be related to one specific domain (eg. finance, science and medicine) in which case you might consider taking additional MLM pretraining on a large corpus. (one related to the task's domain)
This method is known as Domain Adaptive Pretraining, as suggest from Don't Stop Pretraining: Adapt Language Models to Domains and Tasks which has shown to be benenficial for various NLP tasks in terms of precision.
In this repo, I have used two variants of Korean BERT
- Korean BERT from SKT-Brain, with 12-layer of Transformer block
- Its distilled version, DistilBERT, with 3-layer of Transformer block
Then, I have carried out additional MLM pretraining with Korean NLI dataset, released from KakaoBrain where each row is of sent1 \t sent2 \t label
- Place your own data under
./data
directory - Depending on the format of your data (whether each row is of single sentence or multiple sentences) you might need to modify preprocess.py
- run train.py
distill
argument is set to True by default, in which case a distilled Korean BERT is pretrained. You need to change it to False in order to pretrain 12-layered BERT
Input Sequence : The man went to [MASK] store with [MASK] dog
Target Sequence : the his
Randomly 15% of input token will be changed into something, based on under sub-rules
- Randomly 80% of tokens, gonna be a
[MASK]
token - Randomly 10% of tokens, gonna be a
[RANDOM]
token(another word) - Randomly 10% of tokens, will be remain as same. But need to be predicted.