Skip to content

Commit 44a818e

Browse files
authored
Merge pull request #464 from PyThaiNLP/add-LST20-postag
Add LST20 Part-Of-Speech tagger model
2 parents 2a853fa + 557dd7c commit 44a818e

24 files changed

+450
-263
lines changed

docs/api/tag.rst

Lines changed: 53 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -2,12 +2,12 @@
22

33
pythainlp.tag
44
=====================================
5-
The :class:`pythainlp.tag` contains functions that are used to tag different parts of a text including
6-
Part-of-Speech (POS) tags, and Named Entity Recognition (NER) tag.
5+
The :class:`pythainlp.tag` contains functions that are used to mark linguistic and other annotation to different parts of a text including
6+
part-of-speech (POS) tag and named entity (NE) tag.
77

8-
For the POS tags, there are two set of tags including `Universal Dependencies (UD) <https://universaldependencies.org/>`_ and ORCHID [#Sornlertlamvanich_2000]_ POS tags.
8+
For POS tags, there are three set of available tags: `Universal POS tags <https://universaldependencies.org/>`_, ORCHID POS tags [#Sornlertlamvanich_2000]_, and LST20 POS tags [#Prachya_2020]_.
99

10-
The following table shows the list of Part-of-Speech (POS) tags according to Universal Dependencies (UD) POS tags:
10+
The following table shows Universal POS tags as used in Universal Dependencies (UD):
1111

1212
============ ========================== =============================
1313
Abbreviation Part-of-Speech tag Examples
@@ -29,7 +29,7 @@ Abbreviation Part-of-Speech tag Examples
2929
VERB Verb เปิด, ให้, ใช้, เผชิญ, อ่าน
3030
============ ========================== =============================
3131

32-
The following table shows the list of Part-of-Speech (POS) tags according to ORCHID POS tags from the paper:
32+
The following table shows POS tags as used in ORCHID:
3333

3434
============ ================================================= =================================
3535
Abbreviation Part-of-Speech tag Examples
@@ -93,7 +93,7 @@ Abbreviation Part-of-Speech tag Examples
9393

9494
ORCHID corpus uses different set of POS tags. Thus, we make UD POS tags version for ORCHID corpus.
9595

96-
The following table shows the mapping of Part-of-Speech (POS) tags from ORCHID POS tags to UD POS tags:
96+
The following table shows the mapping of POS tags from ORCHID to UD:
9797

9898
=============== =======================
9999
ORCHID POS tags Coresponding UD POS tag
@@ -161,15 +161,54 @@ PUNCT PUNCT
161161
PUNC PUNCT
162162
=============== =======================
163163

164-
For the NER, we use `Inside-outside-beggining (IOB) <https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)>`_ format to tag NER for each words.
165-
For instance, given a sentence "บารัค โอบามาเป็นประธานธิปดี", it would be tag the tokens "บารัค", "โอบามา", "เป็น", "ประธานาธิปดี" as "B-PERSON", "I-PERSON", "I-PERSON", "O", and "O" respectively.
164+
Details about LST20 POS tags are available in [#Prachya_2020]_.
166165

167-
The *B-* prefix indicates begining token for a chunk of person name, "บารัค โอบามา" and *I-* prefix indicates the intermediate token. However, the term *O* indicates that a token not belong to any NER chunk.
166+
The following table shows the mapping of POS tags from LST20 to UD:
168167

169-
The following table shows the list of Named Entity Recognition (NER) tags:
168+
+----------------+-------------------------+
169+
| LST20 POS tags | Coresponding UD POS tag |
170+
+================+=========================+
171+
| AJ | ADJ |
172+
+----------------+-------------------------+
173+
| AV | ADV |
174+
+----------------+-------------------------+
175+
| AX | AUX |
176+
+----------------+-------------------------+
177+
| CC | CCONJ |
178+
+----------------+-------------------------+
179+
| CL | NOUN |
180+
+----------------+-------------------------+
181+
| FX | NOUN |
182+
+----------------+-------------------------+
183+
| IJ | INTJ |
184+
+----------------+-------------------------+
185+
| NN | NOUN |
186+
+----------------+-------------------------+
187+
| NU | NUM |
188+
+----------------+-------------------------+
189+
| PA | PART |
190+
+----------------+-------------------------+
191+
| PR | PROPN |
192+
+----------------+-------------------------+
193+
| PS | ADP |
194+
+----------------+-------------------------+
195+
| PU | PUNCT |
196+
+----------------+-------------------------+
197+
| VV | VERB |
198+
+----------------+-------------------------+
199+
| XX | X |
200+
+----------------+-------------------------+
201+
202+
For the NE, we use `Inside-outside-beggining (IOB) <https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)>`_ format to tag NE for each word.
203+
204+
*B-* prefix indicates the begining token of the chunk. *I-* prefix indicates the intermediate token within the chunk. *O* indicates that the token does not belong to any NE chunk.
205+
206+
For instance, given a sentence "บารัค โอบามาเป็นประธานธิปดี", it would tag the tokens "บารัค", "โอบามา", "เป็น", "ประธานาธิปดี" with "B-PERSON", "I-PERSON", "O", and "O" respectively.
207+
208+
The following table shows named entity (NE) tags as used PyThaiNLP:
170209

171210
============================ =================================
172-
Named Entity Recognition tag Examples
211+
Named Entity tag Examples
173212
============================ =================================
174213
DATE 2/21/2004, 16 ก.พ., จันทร์
175214
TIME 16.30 น., 5 วัน, 1-3 ปี
@@ -214,3 +253,6 @@ References
214253
.. [#Sornlertlamvanich_2000] Takahashi, Naoto & Isahara, Hitoshi & Sornlertlamvanich, Virach. (2000).
215254
Building a Thai part-of-speech tagged corpus (ORCHID).
216255
Journal of the Acoustical Society of Japan (E). 20. 10.1250/ast.20.189.
256+
.. [#Prachya_2020] Prachya Boonkwan and Vorapon Luantangsrisuk and Sitthaa Phaholphinyo and Kanyanat Kriengket and Dhanon Leenoi and Charun Phrombut and Monthika Boriboon and Krit Kosawat and Thepchai Supnithi. (2020).
257+
The Annotation Guideline of LST20 Corpus.
258+
arXiv:2008.05055

pythainlp/corpus/corpus_license.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ thailand_provinces_th.txt | List of Thailand provinces in Thai
2222
tnc_freq.txt | Words and their frequencies, from Thai National Corpus
2323
ttc_freq.txt | Words and their frequencies, from Thai Textbook Corpus
2424
words_th.txt | List of Thai words
25-
words_th_thai2fit_201810.txt | List of Thai words (frozen)
25+
words_th_thai2fit_201810.txt | List of Thai words (frozen for thai2fit)
2626

2727
The following word lists are from **Thai Male and Female Names Corpus**
2828
https://github.com/korkeatw/thai-names-corpus/ by Korkeat Wannapat
@@ -46,14 +46,14 @@ https://creativecommons.org/licenses/by/4.0/
4646

4747
Filename | Description
4848
---------|------------
49-
sentenceseg-crfcut-v2.model | Sentence segmentation model
50-
ud_thai_pud_pt_tagger.pkl | Part-of-speech model
51-
ud_thai_pud_unigram_tagger.json | Part-of-speech model
49+
sentenceseg_crfcut.model | Sentence segmentation model
50+
pos_ud_perceptron.pkl | Part-of-speech tagging model
51+
pos_ud_unigram.json | Part-of-speech tagging model
5252

5353

5454
## Thai WordNet
5555

56-
Thai WordNet (tha-wn.db) is created by Thai Computational Linguistic
56+
Thai WordNet (wordnet_th.db) is created by Thai Computational Linguistic
5757
Laboratory at National Institute of Information and Communications
5858
Technology (NICT), Japan, and released under the following license:
5959

pythainlp/corpus/orchid_pos_th.json

Lines changed: 0 additions & 1 deletion
This file was deleted.
File renamed without changes.

pythainlp/corpus/pos_orchid_unigram.json

Lines changed: 1 addition & 0 deletions
Large diffs are not rendered by default.
File renamed without changes.

pythainlp/corpus/pos_ud_unigram.json

Lines changed: 1 addition & 0 deletions
Large diffs are not rendered by default.
File renamed without changes.

pythainlp/corpus/tnc_freq.txt

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79584,7 +79584,6 @@ Wishbone 1
7958479584
Rubik 1
7958579585
Petesch 1
7958679586
Consider 1
79587-
assertiveperson 1
7958879587
bait 1
7958979588
บรอง 1
7959079589
Elsevier 1

pythainlp/corpus/ttc_freq.txt

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -19409,7 +19409,6 @@
1940919409
โฮสต์ 1
1941019410
โฮเต็ลๆ 1
1941119411
โฮ้งๆ 1
19412-
โฮ้สฺเต้ส์ 1
1941319412
ใจขุ่น 1
1941419413
ใจป้ำ 1
1941519414
ใจมือ 1

0 commit comments

Comments
 (0)