- Character-based word segmentation
- Multi-granularity Lattice (character-word)
- Encoded with Bidirectional-GAT
- Pre-training and Fine-tuning methods
- BMES tagging scheme
- B: beginning, M: middle, E: end, and S: single
- CTB6 (zh):
- word-f1: 98.1
- oov-recall: 90.6
- BCCWJ (ja):
- word-f1: 99.4
- oov-recall: 92.1
- BEST2010 (th):
- char-bin-f1: 99.1
- word-f1: 97.7
- Seven Chinese datasets (converted into simplified Chinese)
- CTB6 (main)
- SIGHAN2005 (AS, CITYU, MSRA, PKU)
- SIGHAN2008 (SXU)
- CNC
- Five Thai datasets
- BEST2010 (main)
- LST20
- TNHC
- VISTEC
- WS160
- Three Japanese datasets
- BCCWJ (main)
- UD Japanese treebank
- Kyoto University Text Corpus
- Place datasets in the same directory.
- For example,
data/zh/ctb6.train.sl
,data/zh/as.train.sl
, etc.
- For example,
- Format each dataset in
sl
(word-segmented sentence line).- In this format, each line contains a word-segmented sentence, with words separated by white spaces.
- zh: https://huggingface.co/yacht/latte-mc-bert-base-chinese-ws
- ja: https://huggingface.co/yacht/latte-mc-bert-base-japanese-ws
- th: https://huggingface.co/yacht/latte-mc-bert-base-thai-ws
model/
- PyTorch model files
pretrained/
- Pre-trained model files
- Ready to be loaded by
transformers
library
- pip
- requirements.txt
pip install -r requirements.txt
- conda
- environment.yml
conda env create -f environment.yml
- See
scripts/
for examples
- Published in Journal of Natural Language Processing