Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
87 commits
Select commit Hold shift + click to select a range
0cbf3ce
Add provinces_all
wannaphong Jul 18, 2020
6cfb947
Update provinces
wannaphong Jul 19, 2020
1fa019e
Update corpus.rst
bact Jul 20, 2020
80e4d8b
Update common.py
wannaphong Jul 23, 2020
362eb2d
Update common.py
wannaphong Jul 23, 2020
88a9488
helper function tone marks and above vowels
p16i Aug 8, 2020
506c1ca
Add pythainlp.util.display_thai_char docs
wannaphong Aug 8, 2020
90d8ed6
fix lint
p16i Aug 8, 2020
7e9852d
Merge branch 'display-thai-char' of github.com:PyThaiNLP/pythainlp in…
p16i Aug 8, 2020
af5c4b9
Fix PyTorch Install for GitHub Actions
wannaphong Aug 8, 2020
e6e01c5
Merge pull request #463 from PyThaiNLP/display-thai-char
p16i Aug 8, 2020
90436b6
Add LST20 postag model
wannaphong Aug 11, 2020
286e5b5
Add lst20_tag_signs and lst20_tag_to_text
wannaphong Aug 12, 2020
60ece5b
Add lst20_ud
wannaphong Aug 12, 2020
008f1c1
Add pos_tag_sents docs
wannaphong Aug 12, 2020
313aa92
Merge pull request #458 from PyThaiNLP/Add-provinces
wannaphong Aug 12, 2020
baba302
Fix type hinting, clean code, remove thailand_provinces_th.txt
bact Aug 13, 2020
8b4beac
Fix PEP8
bact Aug 13, 2020
7e36766
Add docs
wannaphong Aug 14, 2020
9f58ea7
Update tag.rst
wannaphong Aug 14, 2020
d39f7b0
Update tag.rst
wannaphong Aug 14, 2020
67393f5
Update tag.rst
wannaphong Aug 14, 2020
2a853fa
Merge pull request #466 from PyThaiNLP/Add-provinces
bact Aug 16, 2020
46b7416
Update tag.rst
bact Aug 17, 2020
be8f432
Update pos_tag.py
bact Aug 17, 2020
37bcfe4
Update perceptron.py
bact Aug 17, 2020
0841183
Fix typo and format code
bact Aug 17, 2020
c42b1c9
Edit function docstring
bact Aug 17, 2020
ef2e39e
Use list comprehension
bact Aug 17, 2020
7e18f2c
Add test cases
bact Aug 17, 2020
cdf8a78
Use list comprehension
bact Aug 17, 2020
ce7738c
More test cases for corpus.core
bact Aug 18, 2020
cbfe40a
Fix test requests case
bact Aug 18, 2020
9e46931
Test get_corpus_path() with non-existing corpus name
bact Aug 18, 2020
8c25f46
Simplify pos_tag()
bact Aug 18, 2020
61d301a
Refactor, move tagger related functions/constants from the corpus sub…
bact Aug 18, 2020
b480568
Fix tagger filename
bact Aug 18, 2020
9b3b1bd
Clean unigram pos data, minify json, rename corpus filenames
bact Aug 18, 2020
339aa0b
Update model names
bact Aug 18, 2020
10c1e82
Update word lists
bact Aug 18, 2020
e59def9
Add test cases for _ud
bact Aug 19, 2020
3422453
Fix PEP8
bact Aug 19, 2020
39941dc
Refactor
bact Aug 19, 2020
e56aa62
Refactor
bact Aug 19, 2020
557dd7c
Improve docstring
bact Aug 20, 2020
44a818e
Merge pull request #464 from PyThaiNLP/add-LST20-postag
bact Aug 20, 2020
d953aea
Port PerceptronTagger to PyThaiNLP
wannaphong Aug 21, 2020
697f264
Update _tag_perceptron.py
wannaphong Aug 21, 2020
1480d3b
Update _tag_perceptron.py
wannaphong Aug 21, 2020
3a97188
rename perceptron model
wannaphong Aug 21, 2020
73e0bbd
Update perceptron.py
wannaphong Aug 21, 2020
025d48b
PEP8
bact Aug 21, 2020
fc96873
Add some type hintings
bact Aug 21, 2020
423863b
Update _tag_perceptron.py
wannaphong Aug 22, 2020
199077e
Fix PEP8
wannaphong Aug 22, 2020
d5264be
Update _tag_perceptron.py
wannaphong Aug 22, 2020
eea66a4
More type hinting, sort imports
bact Aug 22, 2020
910faab
Add test for PerceptronTagger()
bact Aug 22, 2020
c7e8146
Remove logging
bact Aug 22, 2020
ee0f5e5
Delete unused code
wannaphong Aug 23, 2020
83eee2e
Add test case for train(save_loc=)
bact Aug 23, 2020
aad2507
Add test case load model file that does not exist
bact Aug 23, 2020
9cc4dad
Add MIT License information for _tag_perceptron.py
bact Aug 23, 2020
89d4282
Fix PEP8, add corpus details
bact Aug 23, 2020
f98d18c
Merge pull request #470 from PyThaiNLP/add-perceptron-tagger
bact Aug 25, 2020
9341a8d
Fix typo in README.md
krissdap Aug 26, 2020
0b2927f
Update README.md
bact Aug 26, 2020
7c4bd52
Merge pull request #472 from krissdap/dev
wannaphong Aug 31, 2020
38f8089
Update README.md
bact Sep 10, 2020
579da03
Update README.md
bact Sep 10, 2020
5b8ad72
Update README.md
bact Sep 10, 2020
04ff603
Add family names
bact Sep 12, 2020
d3a7bee
Merge pull request #476 from PyThaiNLP/remotes/origin/add-family-names
bact Sep 12, 2020
2e8933a
Update tag.rst
wannaphong Sep 14, 2020
261f032
Merge pull request #478 from PyThaiNLP/update-postag-docs (build and …
wannaphong Sep 14, 2020
830edab
Fix remove_repeat_vowels() bug that remove spaces between vowel
bact Sep 17, 2020
bdafa57
Merge pull request #481 from PyThaiNLP/hotfix-bug-normalize
bact Sep 17, 2020
bb0c1a4
Bump version: 2.2.3 → 2.2.4-dev0
bact Sep 17, 2020
ced2e3d
Bump version: 2.2.4-dev0 → 2.2.4-beta0
bact Sep 17, 2020
4d00fc4
Bump version: 2.2.4-beta0 → 2.2.4
bact Sep 17, 2020
7dd3cda
Update perceptron.py
wannaphong Sep 17, 2020
ab5a0e7
Merge branch 'dev' of https://github.com/PyThaiNLP/pythainlp into dev
wannaphong Sep 17, 2020
19c68be
Update lst20_tagger version
wannaphong Sep 17, 2020
45e98a3
Update perceptron.py
wannaphong Sep 17, 2020
64bb22c
Merge branch '2.2' into dev
wannaphong Sep 17, 2020
92e97cc
Update PyThaiNLP Version in README
wannaphong Sep 17, 2020
008470a
Merge branch 'dev' of https://github.com/PyThaiNLP/pythainlp into dev
wannaphong Sep 17, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .github/workflows/pythainlp-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ jobs:
run: |
python -m pip install --upgrade pip pytest wheel flake8
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
pip install torch==1.5.0 torchvision==0.6.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install .[full]
pip install deepcut coverage coveralls
- name: Lint with flake8
Expand All @@ -34,4 +35,4 @@ jobs:
- name: Test
run: |
coverage run -m unittest discover
CI_BRANCH=${GITHUB_REF#"ref/heads"} COVERALLS_REPO_TOKEN=${{ secrets.COVERALLS_REPO_TOKEN }} coveralls
CI_BRANCH=${GITHUB_REF#"ref/heads"} COVERALLS_REPO_TOKEN=${{ secrets.COVERALLS_REPO_TOKEN }} coveralls
33 changes: 24 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

<div align="center">
<img src="https://avatars0.githubusercontent.com/u/32934255?s=200&v=4"/>
<h1>PyThaiNLP: Thai Natural Language Processing in Python</h1>
Expand All @@ -24,11 +23,12 @@ PyThaiNLP เป็นไลบารีภาษาไพทอนสำหร

| Version | Description | Status |
|:------:|:--:|:------:|
| [2.2.3](https://github.com/PyThaiNLP/pythainlp/releases) | Stable | [Change Log](https://github.com/PyThaiNLP/pythainlp/issues/330) |
| [2.2.4](https://github.com/PyThaiNLP/pythainlp/releases) | Stable | [Change Log](https://github.com/PyThaiNLP/pythainlp/issues/330) |
| [`dev`](https://github.com/PyThaiNLP/pythainlp/tree/dev) | Release Candidate for 2.3 | [Change Log](https://github.com/PyThaiNLP/pythainlp/issues/445) |

Please follow our [PyThaiNLP Facebook page](https://www.facebook.com/pythainlp/) for more updates.


## Getting Started with PyThaiNLP

We provide [PyThaiNLP Get Started Tutorial](https://www.thainlp.org/pythainlp/tutorials/notebooks/pythainlp_get_started.html) for exploring features in PyThaiNLP; We also have tutorials for specific tasks. Please visit [our tutorial page](https://www.thainlp.org/pythainlp/tutorials).
Expand All @@ -37,27 +37,29 @@ Latest document is available at [https://thainlp.org/pythainlp/docs/2.2/](https:

We try to make the package easy to use as much as possible; therefore, some additional data (like word lists and language models) may get automatically download during runtime. PyThaiNLP caches additional data under the directory `~/pythainlp-data` by default, but the user can change the value by specifying the environment variable `PYTHAINLP_DATA_DIR`. See corpus catalog at [PyThaiNLP/pythainlp-corpus](https://github.com/PyThaiNLP/pythainlp-corpus).


## Capabilities

PyThaiNLP provides standard NLP functions for Thai, for example part-of-speec tagging, linguistic unit segmentation (syllable, word, or sentence). Some of these functions are also available via command-line interface.
PyThaiNLP provides standard NLP functions for Thai, for example part-of-speech tagging, linguistic unit segmentation (syllable, word, or sentence). Some of these functions are also available via command-line interface.

<details>
<summary>List of Features</summary>

- Convenient character and word classes, like Thai consonants (`pythainlp.thai_consonants`), vowels (`pythainlp.thai_vowels`), digits (`pythainlp.thai_digits`), and stop words (`pythainlp.corpus.thai_stopwords`) -- comparable to constants like `string.letters`, `string.digits`, and `string.punctuation`
- Thai linguistic unit segmentation/tokenization, including sentence (`sent_tokenize`), word (`word_tokenize`), and subword segmentations based on Thai Character Cluster (`subword_tokenize`)
- Thai part-of-speech taggers (`pos_tag`)
- Thai part-of-speech tagging (`pos_tag`)
- Thai spelling suggestion and correction (`spell` and `correct`)
- Thai transliteration (`transliterate`)
- Thai soundex (`soundex`) with three engines (`lk82`, `udom83`, `metasound`)
- Thai collation (sort by dictionoary order) (`collate`)
- Thai collation (sort by dictionary order) (`collate`)
- Read out number to Thai words (`bahttext`, `num_to_thaiword`)
- Thai datetime formatting (`thai_strftime`)
- Thai-English keyboard misswitched fix (`eng_to_thai`, `thai_to_eng`)
- Command-line interface for basic functions, like tokenization and pos tagging (run `thainlp` in your shell)
</details>

Please see [our tutorials](https://www.thainlp.org/pythainlp/tutorials) on how to apply these functions to ML problems.
Please see [our tutorials](https://www.thainlp.org/pythainlp/tutorials) on how to apply these functions to machine-learning problems.


## Installation

Expand All @@ -66,7 +68,7 @@ pip install --upgrade pythainlp
```

This will install the latest stable release of PyThaiNLP.
PyThaiNLP uses pip as its package manger and PyPI as its main distribution channel, see [https://pypi.org/project/pythainlp/](https://pypi.org/project/pythainlp/)
PyThaiNLP uses pip as its package manager and PyPI as its main distribution channel, see [https://pypi.org/project/pythainlp/](https://pypi.org/project/pythainlp/)

Install different releases:

Expand Down Expand Up @@ -99,9 +101,9 @@ pip install pythainlp[extra1,extra2,...]
For dependency details, look at `extras` variable in [`setup.py`](https://github.com/PyThaiNLP/pythainlp/blob/dev/setup.py).


## Command-line
## Command-Line Interface

Some of PyThaiNLP functionalities can be used at command line, using `thainlp`
Some of PyThaiNLP functionalities can be used at command line, using `thainlp` command.

For example, displaying a catalog of datasets:
```sh
Expand All @@ -121,6 +123,7 @@ thainlp help
- [Upgrade ThaiNER from 1.7](https://github.com/PyThaiNLP/pythainlp/wiki/Upgrade-ThaiNER-from-PyThaiNLP-1.7-to-PyThaiNLP-2.0)
- Python 2.7 users can use PyThaiNLP 1.6


## Citations

If you use `PyThaiNLP` in your project or publication, please cite the library as follows
Expand Down Expand Up @@ -148,6 +151,7 @@ or BibTeX entry:
- Please do fork and create a pull request :)
- For style guide and other information, including references to algorithms we use, please refer to our [contributing](https://github.com/PyThaiNLP/pythainlp/blob/dev/CONTRIBUTING.md) page.


## Licenses

| | License |
Expand All @@ -157,6 +161,7 @@ or BibTeX entry:
| Language models created by PyThaiNLP | [Creative Commons Attribution 4.0 International Public License (CC-by)](https://creativecommons.org/licenses/by/4.0/) |
| Other corpora and models that may included with PyThaiNLP | See [Corpus License](https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/corpus/corpus_license.md) |


## Sponsors

[![VISTEC-depa Thailand Artificial Intelligence Research Institute](https://airesearch.in.th/assets/img/logo/airesearch-logo.svg)](https://airesearch.in.th/)
Expand All @@ -168,3 +173,13 @@ Since 2019, our contributors Korakot Chaovavanich and Lalita Lowphansirikul have
<div align="center">
Made with ❤️ | PyThaiNLP Team 💻 | "We build Thai NLP" 🇹🇭
</div>

------

<div align="center">
<strong>We have only one official repository at https://github.com/PyThaiNLP/pythainlp and another mirror at https://gitlab.com/pythainlp/pythainlp </strong>
</div>

<div align="center">
<strong>Beware of malware if you use code from mirrors other than the official two at GitHub and GitLab.</strong>
</div>
2 changes: 1 addition & 1 deletion README_TH.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ PyThaiNLP เป็นไลบารีภาษาไพทอนสำหร

| รุ่น | คำอธิบาย | สถานะ |
|:------:|:--:|:------:|
| [2.2.3](https://github.com/PyThaiNLP/pythainlp/releases) | Stable | [Change Log](https://github.com/PyThaiNLP/pythainlp/issues/330) |
| [2.2.4](https://github.com/PyThaiNLP/pythainlp/releases) | Stable | [Change Log](https://github.com/PyThaiNLP/pythainlp/issues/330) |
| [`dev`](https://github.com/PyThaiNLP/pythainlp/tree/dev) | Release Candidate for 2.3 | [Change Log](https://github.com/PyThaiNLP/pythainlp/issues/445) |

ติดตามพวกเราบน [PyThaiNLP Facebook page](https://www.facebook.com/pythainlp/) เพื่อรับข่าวสารเพิ่มเติม
Expand Down
17 changes: 9 additions & 8 deletions docs/api/corpus.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,20 +7,21 @@ The :class:`pythainlp.corpus` provides access to corpus that comes with PyThaiNL
Modules
-------

.. autofunction:: countries
.. autofunction:: get_corpus
.. autofunction:: get_corpus_db
.. autofunction:: get_corpus_db_detail
.. autofunction:: get_corpus_path
.. autofunction:: download
.. autofunction:: remove
.. autofunction:: pythainlp.corpus.common.countries
.. autofunction:: pythainlp.corpus.common.provinces
.. autofunction:: pythainlp.corpus.common.thai_stopwords
.. autofunction:: pythainlp.corpus.common.thai_words
.. autofunction:: pythainlp.corpus.common.thai_syllables
.. autofunction:: pythainlp.corpus.common.thai_negations
.. autofunction:: pythainlp.corpus.common.thai_female_names
.. autofunction:: pythainlp.corpus.common.thai_male_names
.. autofunction:: provinces
.. autofunction:: thai_stopwords
.. autofunction:: thai_words
.. autofunction:: thai_syllables
.. autofunction:: thai_negations
.. autofunction:: thai_family_names
.. autofunction:: thai_female_names
.. autofunction:: thai_male_names
.. autofunction:: pythainlp.corpus.conceptnet.edges

TNC
Expand Down
64 changes: 53 additions & 11 deletions docs/api/tag.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@

pythainlp.tag
=====================================
The :class:`pythainlp.tag` contains functions that are used to tag different parts of a text including
Part-of-Speech (POS) tags, and Named Entity Recognition (NER) tag.
The :class:`pythainlp.tag` contains functions that are used to mark linguistic and other annotation to different parts of a text including
part-of-speech (POS) tag and named entity (NE) tag.

For the POS tags, there are two set of tags including `Universal Dependencies (UD) <https://universaldependencies.org/>`_ and ORCHID [#Sornlertlamvanich_2000]_ POS tags.
For POS tags, there are three set of available tags: `Universal POS tags <https://universaldependencies.org/>`_, ORCHID POS tags [#Sornlertlamvanich_2000]_, and LST20 POS tags [#Prachya_2020]_.

The following table shows the list of Part-of-Speech (POS) tags according to Universal Dependencies (UD) POS tags:
The following table shows Universal POS tags as used in Universal Dependencies (UD):

============ ========================== =============================
Abbreviation Part-of-Speech tag Examples
Expand All @@ -29,7 +29,7 @@ Abbreviation Part-of-Speech tag Examples
VERB Verb เปิด, ให้, ใช้, เผชิญ, อ่าน
============ ========================== =============================

The following table shows the list of Part-of-Speech (POS) tags according to ORCHID POS tags from the paper:
The following table shows POS tags as used in ORCHID:

============ ================================================= =================================
Abbreviation Part-of-Speech tag Examples
Expand Down Expand Up @@ -93,7 +93,7 @@ Abbreviation Part-of-Speech tag Examples

ORCHID corpus uses different set of POS tags. Thus, we make UD POS tags version for ORCHID corpus.

The following table shows the mapping of Part-of-Speech (POS) tags from ORCHID POS tags to UD POS tags:
The following table shows the mapping of POS tags from ORCHID to UD:

=============== =======================
ORCHID POS tags Coresponding UD POS tag
Expand Down Expand Up @@ -161,15 +161,54 @@ PUNCT PUNCT
PUNC PUNCT
=============== =======================

For the NER, we use `Inside-outside-beggining (IOB) <https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)>`_ format to tag NER for each words.
For instance, given a sentence "บารัค โอบามาเป็นประธานธิปดี", it would be tag the tokens "บารัค", "โอบามา", "เป็น", "ประธานาธิปดี" as "B-PERSON", "I-PERSON", "I-PERSON", "O", and "O" respectively.
Details about LST20 POS tags are available in [#Prachya_2020]_.

The *B-* prefix indicates begining token for a chunk of person name, "บารัค โอบามา" and *I-* prefix indicates the intermediate token. However, the term *O* indicates that a token not belong to any NER chunk.
The following table shows the mapping of POS tags from LST20 to UD:

The following table shows the list of Named Entity Recognition (NER) tags:
+----------------+-------------------------+
| LST20 POS tags | Coresponding UD POS tag |
+================+=========================+
| AJ | ADJ |
+----------------+-------------------------+
| AV | ADV |
+----------------+-------------------------+
| AX | AUX |
+----------------+-------------------------+
| CC | CCONJ |
+----------------+-------------------------+
| CL | NOUN |
+----------------+-------------------------+
| FX | NOUN |
+----------------+-------------------------+
| IJ | INTJ |
+----------------+-------------------------+
| NN | NOUN |
+----------------+-------------------------+
| NU | NUM |
+----------------+-------------------------+
| PA | PART |
+----------------+-------------------------+
| PR | PROPN |
+----------------+-------------------------+
| PS | ADP |
+----------------+-------------------------+
| PU | PUNCT |
+----------------+-------------------------+
| VV | VERB |
+----------------+-------------------------+
| XX | X |
+----------------+-------------------------+

For the NE, we use `Inside-outside-beggining (IOB) <https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)>`_ format to tag NE for each word.

*B-* prefix indicates the begining token of the chunk. *I-* prefix indicates the intermediate token within the chunk. *O* indicates that the token does not belong to any NE chunk.

For instance, given a sentence "บารัค โอบามาเป็นประธานธิปดี", it would tag the tokens "บารัค", "โอบามา", "เป็น", "ประธานาธิปดี" with "B-PERSON", "I-PERSON", "O", and "O" respectively.

The following table shows named entity (NE) tags as used PyThaiNLP:

============================ =================================
Named Entity Recognition tag Examples
Named Entity tag Examples
============================ =================================
DATE 2/21/2004, 16 ก.พ., จันทร์
TIME 16.30 น., 5 วัน, 1-3 ปี
Expand Down Expand Up @@ -214,3 +253,6 @@ References
.. [#Sornlertlamvanich_2000] Virach Sornlertlamvanich, Naoto Takahashi and Hitoshi Isahara. (2000).
Building a Thai Part-Of-Speech Tagged Corpus (ORCHID).
The Journal of the Acoustical Society of Japan (E), Vol.20, No.3, pp 189-198, May 1999.
.. [#Prachya_2020] Prachya Boonkwan and Vorapon Luantangsrisuk and Sitthaa Phaholphinyo and Kanyanat Kriengket and Dhanon Leenoi and Charun Phrombut and Monthika Boriboon and Krit Kosawat and Thepchai Supnithi. (2020).
The Annotation Guideline of LST20 Corpus.
arXiv:2008.05055
1 change: 1 addition & 0 deletions docs/api/util.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ Modules
.. autofunction:: collate
.. autofunction:: dict_trie
.. autofunction:: digit_to_text
.. autofunction:: display_thai_char
.. autofunction:: eng_to_thai
.. autofunction:: find_keyword
.. autofunction:: countthai
Expand Down
2 changes: 1 addition & 1 deletion pythainlp/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# -*- coding: utf-8 -*-
__version__ = "2.2.3"
__version__ = "2.2.4"

thai_consonants = "กขฃคฅฆงจฉชซฌญฎฏฐฑฒณดตถทธนบปผฝพฟภมยรลวศษสหฬอฮ" # 44 chars

Expand Down
2 changes: 2 additions & 0 deletions pythainlp/corpus/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
"get_corpus_path",
"provinces",
"remove",
"thai_family_names",
"thai_female_names",
"thai_male_names",
"thai_negations",
Expand Down Expand Up @@ -86,6 +87,7 @@ def corpus_db_path() -> str:
from pythainlp.corpus.common import (
countries,
provinces,
thai_family_names,
thai_female_names,
thai_male_names,
thai_negations,
Expand Down
Loading