You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/api/tag.rst
+53-11Lines changed: 53 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,12 +2,12 @@
2
2
3
3
pythainlp.tag
4
4
=====================================
5
-
The :class:`pythainlp.tag` contains functions that are used to tag different parts of a text including
6
-
Part-of-Speech (POS) tags, and Named Entity Recognition (NER) tag.
5
+
The :class:`pythainlp.tag` contains functions that are used to mark linguistic and other annotation to different parts of a text including
6
+
part-of-speech (POS) tag and named entity (NE) tag.
7
7
8
-
For the POS tags, there are two set of tags including `Universal Dependencies (UD)<https://universaldependencies.org/>`_ and ORCHID [#Sornlertlamvanich_2000]_POS tags.
8
+
For POS tags, there are three set of available tags: `Universal POS tags<https://universaldependencies.org/>`_, ORCHID POS tags [#Sornlertlamvanich_2000]_, and LST20 POS tags [#Prachya_2020]_.
9
9
10
-
The following table shows the list of Part-of-Speech (POS) tags according to Universal Dependencies (UD) POS tags:
10
+
The following table shows Universal POS tags as used in Universal Dependencies (UD):
@@ -93,7 +93,7 @@ Abbreviation Part-of-Speech tag Examples
93
93
94
94
ORCHID corpus uses different set of POS tags. Thus, we make UD POS tags version for ORCHID corpus.
95
95
96
-
The following table shows the mapping of Part-of-Speech (POS) tags from ORCHID POS tags to UD POS tags:
96
+
The following table shows the mapping of POS tags from ORCHID to UD:
97
97
98
98
=============== =======================
99
99
ORCHID POS tags Coresponding UD POS tag
@@ -161,15 +161,54 @@ PUNCT PUNCT
161
161
PUNC PUNCT
162
162
=============== =======================
163
163
164
-
For the NER, we use `Inside-outside-beggining (IOB) <https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)>`_ format to tag NER for each words.
165
-
For instance, given a sentence "บารัค โอบามาเป็นประธานธิปดี", it would be tag the tokens "บารัค", "โอบามา", "เป็น", "ประธานาธิปดี" as "B-PERSON", "I-PERSON", "I-PERSON", "O", and "O" respectively.
164
+
Details about LST20 POS tags are available in [#Prachya_2020]_.
166
165
167
-
The *B-* prefix indicates begining token for a chunk of person name, "บารัค โอบามา" and *I-* prefix indicates the intermediate token. However, the term *O* indicates that a token not belong to any NER chunk.
166
+
The following table shows the mapping of POS tags from LST20 to UD:
168
167
169
-
The following table shows the list of Named Entity Recognition (NER) tags:
168
+
+----------------+-------------------------+
169
+
| LST20 POS tags | Coresponding UD POS tag |
170
+
+================+=========================+
171
+
| AJ | ADJ |
172
+
+----------------+-------------------------+
173
+
| AV | ADV |
174
+
+----------------+-------------------------+
175
+
| AX | AUX |
176
+
+----------------+-------------------------+
177
+
| CC | CCONJ |
178
+
+----------------+-------------------------+
179
+
| CL | NOUN |
180
+
+----------------+-------------------------+
181
+
| FX | NOUN |
182
+
+----------------+-------------------------+
183
+
| IJ | INTJ |
184
+
+----------------+-------------------------+
185
+
| NN | NOUN |
186
+
+----------------+-------------------------+
187
+
| NU | NUM |
188
+
+----------------+-------------------------+
189
+
| PA | PART |
190
+
+----------------+-------------------------+
191
+
| PR | PROPN |
192
+
+----------------+-------------------------+
193
+
| PS | ADP |
194
+
+----------------+-------------------------+
195
+
| PU | PUNCT |
196
+
+----------------+-------------------------+
197
+
| VV | VERB |
198
+
+----------------+-------------------------+
199
+
| XX | X |
200
+
+----------------+-------------------------+
201
+
202
+
For the NE, we use `Inside-outside-beggining (IOB) <https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)>`_ format to tag NE for each word.
203
+
204
+
*B-* prefix indicates the begining token of the chunk. *I-* prefix indicates the intermediate token within the chunk. *O* indicates that the token does not belong to any NE chunk.
205
+
206
+
For instance, given a sentence "บารัค โอบามาเป็นประธานธิปดี", it would tag the tokens "บารัค", "โอบามา", "เป็น", "ประธานาธิปดี" with "B-PERSON", "I-PERSON", "O", and "O" respectively.
207
+
208
+
The following table shows named entity (NE) tags as used PyThaiNLP:
Building a Thai part-of-speech tagged corpus (ORCHID).
216
255
Journal of the Acoustical Society of Japan (E). 20. 10.1250/ast.20.189.
256
+
.. [#Prachya_2020] Prachya Boonkwan and Vorapon Luantangsrisuk and Sitthaa Phaholphinyo and Kanyanat Kriengket and Dhanon Leenoi and Charun Phrombut and Monthika Boriboon and Krit Kosawat and Thepchai Supnithi. (2020).
0 commit comments