Skip to content

Latest commit

 

History

History
89 lines (74 loc) · 2.98 KB

README.md

File metadata and controls

89 lines (74 loc) · 2.98 KB

Multi-criteria Word Segmentation Pre-training with LATTE

latte-ptm-ws


Python Version PyTorch Version Lightning Version PyG Version AllenNLP Light Version CUDA Version Apache License

LATTE: Lattice ATTentive Encoding for Character-based Word Segmentation

Architecture

  • Character-based word segmentation
  • Multi-granularity Lattice (character-word)
    • Encoded with Bidirectional-GAT
  • Pre-training and Fine-tuning methods
  • BMES tagging scheme
    • B: beginning, M: middle, E: end, and S: single

Segmentation Performance (including char-bin-f1, word-f1, oov-recall)

  • CTB6 (zh):
    • word-f1: 98.1
    • oov-recall: 90.6
  • BCCWJ (ja):
    • word-f1: 99.4
    • oov-recall: 92.1
  • BEST2010 (th):
    • char-bin-f1: 99.1
    • word-f1: 97.7

Datasets (test sets were excluded)

  • Seven Chinese datasets (converted into simplified Chinese)
    • CTB6 (main)
    • SIGHAN2005 (AS, CITYU, MSRA, PKU)
    • SIGHAN2008 (SXU)
    • CNC
  • Five Thai datasets
    • BEST2010 (main)
    • LST20
    • TNHC
    • VISTEC
    • WS160
  • Three Japanese datasets
    • BCCWJ (main)
    • UD Japanese treebank
    • Kyoto University Text Corpus

Dataset Notes

  • Place datasets in the same directory.
    • For example, data/zh/ctb6.train.sl, data/zh/as.train.sl, etc.
  • Format each dataset in sl (word-segmented sentence line).
    • In this format, each line contains a word-segmented sentence, with words separated by white spaces.

Pre-trained Models can be found at

Saved Model Directories

  • model/
    • PyTorch model files
  • pretrained/
    • Pre-trained model files
    • Ready to be loaded by transformers library

Requirements

  • pip
    • requirements.txt
    • pip install -r requirements.txt
  • conda
    • environment.yml
    • conda env create -f environment.yml

Usage

  • See scripts/ for examples

Citation