GitHub - tchayintr/latte-ptm-ws

Multi-criteria Word Segmentation Pre-training with LATTE

latte-ptm-ws

LATTE: Lattice ATTentive Encoding for Character-based Word Segmentation

https://github.com/tchayintr/latte-ws

Architecture

Character-based word segmentation
Multi-granularity Lattice (character-word)
- Encoded with Bidirectional-GAT
Pre-training and Fine-tuning methods
BMES tagging scheme
- B: beginning, M: middle, E: end, and S: single

Segmentation Performance (including char-bin-f1, word-f1, oov-recall)

CTB6 (zh):
- word-f1: 98.1
- oov-recall: 90.6
BCCWJ (ja):
- word-f1: 99.4
- oov-recall: 92.1
BEST2010 (th):
- char-bin-f1: 99.1
- word-f1: 97.7

Datasets (test sets were excluded)

Seven Chinese datasets (converted into simplified Chinese)
- CTB6 (main)
- SIGHAN2005 (AS, CITYU, MSRA, PKU)
- SIGHAN2008 (SXU)
- CNC
Five Thai datasets
- BEST2010 (main)
- LST20
- TNHC
- VISTEC
- WS160
Three Japanese datasets
- BCCWJ (main)
- UD Japanese treebank
- Kyoto University Text Corpus

Dataset Notes

Place datasets in the same directory.
- For example, data/zh/ctb6.train.sl, data/zh/as.train.sl, etc.
Format each dataset in sl (word-segmented sentence line).
- In this format, each line contains a word-segmented sentence, with words separated by white spaces.

Pre-trained Models can be found at

Saved Model Directories

model/
- PyTorch model files
pretrained/
- Pre-trained model files
- Ready to be loaded by transformers library

Requirements

pip
- requirements.txt
- pip install -r requirements.txt
conda
- environment.yml
- conda env create -f environment.yml

Usage

See scripts/ for examples

Citation

Published in Journal of Natural Language Processing
- https://www.jstage.jst.go.jp/article/jnlp/30/2/30_456/_article/-char/ja

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
models		models
pretrained		pretrained
scripts		scripts
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-criteria Word Segmentation Pre-training with LATTE

LATTE: Lattice ATTentive Encoding for Character-based Word Segmentation

Architecture

Segmentation Performance (including char-bin-f1, word-f1, oov-recall)

Datasets (test sets were excluded)

Dataset Notes

Pre-trained Models can be found at

Saved Model Directories

Requirements

Usage

Citation

About

Releases

Packages

Languages

License

tchayintr/latte-ptm-ws

Folders and files

Latest commit

History

Repository files navigation

Multi-criteria Word Segmentation Pre-training with LATTE

LATTE: Lattice ATTentive Encoding for Character-based Word Segmentation

Architecture

Segmentation Performance (including char-bin-f1, word-f1, oov-recall)

Datasets (test sets were excluded)

Dataset Notes

Pre-trained Models can be found at

Saved Model Directories

Requirements

Usage

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages