Skip to content
/ LUA Public

Open-source code for our paper at Findings of EMNLP-2021: "Segmenting Natural Language Sentences via Lexical Unit Analysis".

Notifications You must be signed in to change notification settings

LeePleased/LUA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

An implementation of Lexical Unit Analysis (LUA) for sequence segmentation tasks (e.g., Chinese POS Tagging). Note that this is not an officially supported Tencent product.

Preparation

Two steps. Firstly, reformulate the chunking data sets and move them into a new folder named "dataset". The folder contains {train, dev, test}.json. Each JSON file is a list of dicts. See the following NER case:

[ 
 {
  "sentence": "['Somerset', '83', 'and', '174', '(', 'P.', 'Simmons']",
  "labeled entities": "[(0, 0, 'ORG'), (1, 1, 'O'), (2, 2, 'O'), (3, 3, 'O'), (4, 4, 'O'), (5, 6, 'PER')]",
 },
 {
  "sentence": "['Leicestershire', '22', 'points', ',', 'Somerset', '4', '.']",
  "labeled entities": "[(0, 0, 'ORG'), (1, 1, 'O'), (2, 2, 'O'), (3, 3, 'O'), (4, 4, 'ORG'), (5, 5, 'O'), (0, 0, 'O')]",
 }
]

Secondly, pretrained LM (i.e., BERT) and evaluation script. Create another directory, "resource", with the following arrangement:

  • resource
    • pretrained_lm
      • model.pt
      • vocab.txt
    • conlleval.pl

For Chinese tasks, the source to construct "pretrained_lm" is bert-base-chinese.

Training and Test

CUDA_VISIBLE_DEVICES=0 python main.py -dd dataset -sd dump -rd resource

Citation

@inproceedings{li-etal-2021-segmenting-natural,
    title = "Segmenting Natural Language Sentences via Lexical Unit Analysis",
    author = "Li, Yangming  and  Liu, Lemao  and  Shi, Shuming",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-emnlp.18",
    doi = "10.18653/v1/2021.findings-emnlp.18",
    pages = "181--187",
}

About

Open-source code for our paper at Findings of EMNLP-2021: "Segmenting Natural Language Sentences via Lexical Unit Analysis".

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages