Source code and data for SMedBERT: A Knowledge-Enhanced Pre-trained Language Model withStructured Semantics for Medical Text Mining
Our code is almost ready for you. Due to the importance of commercial KG, we have to get permission from DXY whether the private owned datasets and KGs are allowed to be public or not. For now, we only release our code for you to train SMedBERT model with your Chinese medical data. If they agree to release the data, we will release 5% of the entities in KG and its embedding trained by KGE algorithm for you to play around, and for now you have to use your own KG. Feel free to open issue if you have any question.
- Python 3.6
- PyTorch 1.6
- transformers==2.2.1
- numpy
- tqdm
- scikit-learn
- jieba
We release datasets with "open" licenses in the hub and for other datasets you may have to acquire them by your own.
Since the Authorization in DXY is very slow, and keeping people always waiting is embarrassing, I summarize the pre-training process as follow so you can use your own KG to play the model!
Our framework only need to use entities( with their corresponding alias and types) and relations bewteen them in your KG. In fact, the alias are only used to linking the spans in the input sentence to the enities in KG, and so when a custom entit-linker is available for your KG, the alias are not necessary.
The entity, relation and transfer matrix weights are nessary to use our framework, as you see in there. I recommend to use DGL-KE to train the embedding since it is fast and scale to very large KG.
As we mention in the paper, for a entity in KG, it may has too many neighbours and we have to decide use which of them. We perform PageRank on the KG and the value for each entity(node) is used as weight as shown in there. You need to arrange it into the json foramt.
As we often need to use the neighbours of a linked entity, we decide to build a dict beforehand to avoid unnecessary computing. We need two files, 'ent2outRel.pkl' and 'ent2inRel.pkl' respectively for out and in directions relations. The format should be ent_name -> [(rel_name,ent_name), ..., (rel_name,ent_name)].
As we propose the hyper-attention that makes use of entity types knowledge, we need a dict to provide our model with such information. The format should be ent_name -> type_val, the type_val could be type name or type id.
As shown in there, name2id files are needed to provide the mapping bewteen entities and their corrponding resouces. The format is obvious as you see.
python -m torch.distributed.launch --nproc_per_node=4 run_pretraining_stream.py
Note that since the there are very large files need to be loaded into memory, the program may appear to freeze at first.
- Download the Pre-trained Model (BaiduPan link: https://pan.baidu.com/s/1T0L6uv3JzY6dT3mcX_mghQ passwd: ea6f), and put it into the main folder.
- Download the KG embedding: (BaiduPan link:https://pan.baidu.com/s/19V-M70TdndPCR50r5Z2OYQ passwd:0000), and put it into the /kgs folder.
- Example of how to run pre-training process.
CUDA_VISIBLE_DEVICES=0 ./run_pretraining_stream.py
- Example of how to run NER on CMedQANER dataset.
CUDA_VISIBLE_DEVICES=0 ./run_ner_cmedqa.sh
Note we force to only use single GPU for now.
@inproceedings{zhang-etal-2021-smedbert,
title = "{SM}ed{BERT}: A Knowledge-Enhanced Pre-trained Language Model with Structured Semantics for Medical Text Mining",
author = "Zhang, Taolin and
Cai, Zerui and
Wang, Chengyu and
Qiu, Minghui and
Yang, Bite and
He, Xiaofeng",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.acl-long.457",
doi = "10.18653/v1/2021.acl-long.457",
pages = "5882--5893"
}