PyThaiNLP: Thai Natural Language Processing in Python

PyThaiNLP is a Python package for text processing and linguistic analysis, similar to NLTK with focus on Thai language.

PyThaiNLP เป็นไลบารีภาษาไพทอนสำหรับประมวลผลภาษาธรรมชาติ โดยเน้นภาษาไทย ดูรายละเอียดภาษาไทยด้านล่าง

News

We are conducting a 2-minute survey to know more about your experience using the library and your expectations regarding what the library should be able to do. Take part in this survey: https://forms.gle/aLdSHnvkNuK5CFyt9

The latest stable release is 2.2.2. See 2.2 change log.
For latest development, see dev branch. See ongoing 2.3 development change log.

Using PyThaiNLP:

PyThaiNLP Get Started
More tutorials at https://www.thainlp.org/pythainlp/tutorials/
See full documentation at https://thainlp.org/pythainlp/docs/2.2/
Some additional data (like word lists and language models) may get automatically download during runtime and it will be kept under the directory ~/pythainlp-data by default. See corpus catalog at https://github.com/PyThaiNLP/pythainlp-corpus.
The data location can be changed, using PYTHAINLP_DATA_DIR environment variable.
For PyThaiNLP tokenization performance and measurement methods, see tokenization benchmark
📫 follow our PyThaiNLP Facebook page

Capabilities

Convenient character and word classes, like Thai consonants (pythainlp.thai_consonants), vowels (pythainlp.thai_vowels), digits (pythainlp.thai_digits), and stop words (pythainlp.corpus.thai_stopwords) -- comparable to constants like string.letters, string.digits, and string.punctuation
Thai linguistic unit segmentation/tokenization, including sentence (sent_tokenize), word (word_tokenize), and subword segmentations based on Thai Character Cluster (subword_tokenize)
Thai part-of-speech taggers (pos_tag)
Thai spelling suggestion and correction (spell and correct)
Thai transliteration (transliterate)
Thai soundex (soundex) with three engines (lk82, udom83, metasound)
Thai collation (sort by dictionoary order) (collate)
Read out number to Thai words (bahttext, num_to_thaiword)
Thai datetime formatting (thai_strftime)
Thai-English keyboard misswitched fix (eng_to_thai, thai_to_eng)
Command-line interface for basic functions, like tokenization and pos tagging (run thainlp in your shell)
and much more - see examples in tutorials.

Installation

PyThaiNLP uses PyPI as its main distribution channel, see https://pypi.org/project/pythainlp/

Stable release

pip install pythainlp

Development pre-release

pip install --upgrade --pre pythainlp

Fresh from dev branch

pip install https://github.com/PyThaiNLP/pythainlp/archive/dev.zip

Install options

For some functionalities, like named-entity recognition, extra packages may be needed. Install them with these install options:

pip install pythainlp[extra1,extra2,...]

where `extras` can be

attacut (to support attacut, a fast and accurate tokenizer)
benchmarks (for word tokenization benchmarking)
icu (for ICU, International Components for Unicode, support in transliteration and tokenization)
ipa (for IPA, International Phonetic Alphabet, support in transliteration)
ml (to support ULMFiT models for classification)
thai2fit (for Thai word vector)
thai2rom (for machine-learnt romanization)
wordnet (for Thai WordNet API)
full (install everything)

For dependency details, look at extras variable in setup.py.

Command-line

Some of PyThaiNLP functionalities can be used at command line, using thainlp

For example, displaying a catalog of datasets:

thainlp data catalog

Showing how to use:

thainlp help

Python 2 Users

PyThaiNLP 2 supports Python 3.6+. Some functions may work with older version of Python 3, but it is not well-tested and will not be supported. See 1.7 -> 2.0 change log.
- Upgrading from 1.7
- Upgrade ThaiNER from 1.7
Python 2.7 users can use PyThaiNLP 1.6

Citations

If you use PyThaiNLP in your project or publication, please cite the library as follows

Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, & Pattarawat Chormai. (2016, Jun 27). PyThaiNLP: Thai Natural Language Processing in Python. Zenodo. http://doi.org/10.5281/zenodo.3519354

or BibTeX entry:

@misc{pythainlp,
    author       = {Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, Pattarawat Chormai},
    title        = {{PyThaiNLP: Thai Natural Language Processing in Python}},
    month        = Jun,
    year         = 2016,
    doi          = {10.5281/zenodo.3519354},
    publisher    = {Zenodo},
    url          = {http://doi.org/10.5281/zenodo.3519354}
}

Contribute to PyThaiNLP

Please do fork and create a pull request :)
For style guide and other information, including references to algorithms we use, please refer to our contributing page.

Licenses

PyThaiNLP source code and notebooks are released under Apache Software License 2.0.
All corpora, datasets, and documentation created by PyThaiNLP project are released under Creative Commons Zero 1.0 Universal Public Domain Dedication License (CC0).
All language models created by PyThaiNLP project are released under Creative Commons Attribution 4.0 International Public License (CC-by).
For more information about corpora and models created by PyThaiNLP project, see PyThaiNLP Corpus.
For other corpora and models that may included with PyThaiNLP distribution, please advise Corpus License.

ภาษาไทย

PyThaiNLP เป็นไลบรารีภาษาไพทอนสำหรับประมวลผลภาษาธรรมชาติ โดยเน้นภาษาไทย แจกจ่ายฟรี (ตลอดไป) เพื่อคนไทยและชาวโลกทุกคน!

เพราะโลกขับเคลื่อนต่อไปด้วยการแบ่งปัน

ข่าวสาร

สวัสดีค่ะ ทีมพัฒนา PyThaiNLP ขอสอบถามความคิดเห็นของผู้ใช้งาน PyThaiNLP หรือผู้ที่ทำงานในด้านการประมวลผลภาษาไทย เพื่อนำข้อมูลไปปรับปรุงและพัฒนาฟีเจอร์ใหม่ๆ ให้ตรงกับความต้องการใช้งานมากขึ้น สามารถตอบแบบสอบถามได้ที่ https://forms.gle/aLdSHnvkNuK5CFyt9 (ใช้เวลาประมาณ 2-5 นาที)

รุ่นเสถียรล่าสุดคือรุ่น 2.2.2
PyThaiNLP 2 รองรับ Python 3.6 ขึ้นไป
- ผู้ใช้ Python 2.7+ ยังสามารถใช้ PyThaiNLP 1.6 ได้
📫 ติดตามข่าวสารได้ที่ Facebook PyThaiNLP

ใช้งาน PyThaiNLP:

เริ่มต้นใช้งาน PyThaiNLP
สอนการใช้งานเพิ่มเติม ในรูปแบบ notebook https://www.thainlp.org/pythainlp/tutorials/
เอกสารตัวเต็ม https://thainlp.org/pythainlp/docs/2.2/
ระหว่างการทำงาน PyThaiNLP อาจดาวน์โหลดข้อมูลเพิ่มเติม เช่น ตัวแบบภาษา และรายการคำ ข้อมูลเหล่านี้จะถูกเก็บไว้ที่ไดเรกทอรี ~/pythainlp-data เป็นตำแหน่งมาตรฐาน
ตำแหน่งเก็บข้อมูลนี้สามารถกำหนดเองได้ โดยการเปลี่ยนแปลงตัวแปรสิ่งแวดล้อม PYTHAINLP_DATA_DIR ของระบบปฏิบัติการ

ความสามารถ

ชุดค่าคงที่ตัวอักษระและคำไทยที่เรียกใช้ได้สะดวก เช่น พยัญชนะ (pythainlp.thai_consonants), สระ (pythainlp.thai_vowels), ตัวเลขไทย (pythainlp.thai_digits), และ stop word (pythainlp.corpus.thai_stopwords) -- เหมือนกับค่าคงที่อย่าง string.letters, string.digits, และ string.punctuation
แบ่งหน่วยทางภาษาศาสตร์ในภาษาไทย รวมถึงการแบ่งประโยค (sent_tokenize) แบ่งคำ (word_tokenize) และการแบ่งระดับต่ำกว่าคำโดยใช้ Thai Character Clusters (subword_tokenize)
ระบุชนิดคำ (part-of-speech) ภาษาไทย (pos_tag)
แนะนำและแก้ตัวสะกดในภาษาไทย (spell, correct)
ถอดเสียงภาษาไทยเป็นอักษรละตินและสัทอักษร (transliterate)
soundex ภาษาไทย (soundex) 3 วิธีการ (lk82, udom83, metasound)
เรียงลำดับคำตามพจนานุกรมไทย (collate)
อ่านตัวเลขเป็นข้อความภาษาไทย (bahttext, num_to_thaiword)
รูปแบบวันที่และเวลาไทย (thai_strftime)
แก้ไขปัญหาการพิมพ์ลืมเปลี่ยนภาษา (eng_to_thai, thai_to_eng)
และอื่น ๆ ดูตัวอย่างได้ใน tutorials สอนวิธีใช้งาน

ติดตั้ง

รุ่นเสถียร

pip install pythainlp

รุ่นก่อนเผยแพร่ (pre-release)

pip install --upgrade --pre pythainlp

รุ่นกำลังพัฒนา (dev branch)

pip install https://github.com/PyThaiNLP/pythainlp/archive/dev.zip

การติดตั้งความสามารถเพิ่มเติม

สำหรับความสามารถบางอย่าง เช่น การชื่อเฉพาะ (named-entity) จำเป็นต้องติดตั้งแพคเกจเสริม ด้วยการระบุออปชันตอน pip install:

pip install pythainlp[extra1,extra2,...]

โดยที่ extras คือ

attacut (ตัวตัดคำที่แม่นกว่า newmm เมื่อเทียบกับชุดข้อมูล BEST)
benchmarks (สำหรับเครื่องมือวัดความแม่นยำของตัวตัดคำ)
icu (สำหรับการถอดตัวสะกดเป็นสัทอักษรและการตัดคำด้วย ICU)
ipa (สำหรับการถอดตัวสะกดเป็นสัทอักษรสากล (IPA))
ml (สำหรับการรองรับโมเดล ULMFiT)
thai2fit (สำหรับ word vector)
thai2rom (สำหรับการถอดตัวสะกดเป็นอักษรละติน)
wordnet (สำหรับ API WordNet ภาษาไทย)
full (ติดตั้งทุกอย่าง)

รายละเอียดของแพคเกจเสริมดูได้ในตัวแปรชื่อ extras ใน setup.py

เรียกใช้จากบรรทัดคำสั่ง

ความสามารถบางส่วนของ PyThaiNLP สามารถเรียกใช้ได้จาก command line โดยเรียก thainlp ที่เชลล์

เช่น แสดงรายชื่อชุดข้อมูลที่เรียกติดตั้งได้

thainlp data catalog

เรียกดูคำสั่งที่ใช้ได้

thainlp help

การอ้างอิง

หากคุณใช้ PyThaiNLP ในโปรเจคของคุณหรืองานวิจัย คุณสามารถอ้างอิงได้ตามนี้

Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, & Pattarawat Chormai. (2016, Jun 27). PyThaiNLP: Thai Natural Language Processing in Python. Zenodo. http://doi.org/10.5281/zenodo.3519354

หรือ BibTeX entry:

@misc{pythainlp,
    author       = {Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, Pattarawat Chormai},
    title        = {{PyThaiNLP: Thai Natural Language Processing in Python}},
    month        = Jun,
    year         = 2016,
    doi          = {10.5281/zenodo.3519354},
    publisher    = {Zenodo},
    url          = {http://doi.org/10.5281/zenodo.3519354}
}

สนับสนุนและร่วมพัฒนา

ทุกคนสามารถร่วมพัฒนาโครงการนี้ได้ โดยการ fork และส่ง pull request กลับมา
อ่านเอกสารแนะนำการพัฒนา

สัญญาอนุญาต

รหัสต้นฉบับ (ซอร์สโค้ด) และโน๊ตบุ๊ก (IPython Notebook) ของ PyThaiNLP เผยแพร่ภายใต้สัญญาอนุญาต Apache Software License 2.0
ทรัพยากรภาษา รายการคำ และเอกสารที่สร้างโดยโครงการ PyThaiNLP เผยแพร่ภายใต้สัญญาอนุญาตมอบให้เป็นสมบัติสาธารณะครีเอทีฟคอมมอนส์ 1.0 (Creative Commons Zero 1.0 Universal Public Domain Dedication License) (CC0)
ตัวแบบภาษา (language model) ที่สร้างโดยโครงการ PyThaiNLP เผยแพร่ภายใต้สัญญาอนุญาตครีเอทีฟคอมมอนส์แบบแสดงที่มา 4.0 (Creative Commons Attribution 4.0 International Public License) (CC-by)
ข้อมูลเพิ่มเติมเกี่ยวกับทรัพยากรภาษาและตัวแบบภาษาที่สร้างโดยโครงการ PyThaiNLP ดูที่ PyThaiNLP Corpus
คลังคำ ตัวแบบภาษา และข้อมูลอื่น ที่แจกจ่ายพร้อมกับแพคเกจ PyThaiNLP อาจใช้สัญญาอนุญาตอื่น โปรดดูเอกสาร Corpus License
ตราสัญลักษณ์ออกแบบโดยคุณ วรุตม์ พสุธาดล จากการประกวดที่ในกลุ่มเฟซบุ๊ก 1 2

ผู้สนับสนุน

ตั้งแต่ปี 2562 การสมทบพัฒนา PyThaiNLP โดย Korakot Chaovavanich และ Lalita Lowphansirikul สนับสนุนโดย สถาบันวิจัยปัญญาประดิษฐ์ประเทศไทย (VISTEC-depa Thailand Artificial Intelligence Research Institute)

สร้างด้วย ❤️
ทีม PyThaiNLP
"พวกเราสร้าง Thai NLP"

Name		Name	Last commit message	Last commit date
Latest commit History 2,661 Commits
.circleci		.circleci
.github		.github
docs		docs
pythainlp		pythainlp
tests		tests
.gitignore		.gitignore
.travis.yml.old		.travis.yml.old
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
appveyor.docs.yml		appveyor.docs.yml
appveyor.yml		appveyor.yml
build_pypi.bat		build_pypi.bat
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
tokenization-benchmark.md		tokenization-benchmark.md
tox.ini		tox.ini
travis_pypi_setup.py		travis_pypi_setup.py
upload_pypi.bat		upload_pypi.bat

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PyThaiNLP: Thai Natural Language Processing in Python

Capabilities

Installation

Stable release

Development pre-release

Fresh from dev branch

Install options

Command-line

Python 2 Users

Citations

Contribute to PyThaiNLP

Licenses

Sponsors

ภาษาไทย

ความสามารถ

ติดตั้ง

รุ่นเสถียร

รุ่นก่อนเผยแพร่ (pre-release)

รุ่นกำลังพัฒนา (dev branch)

การติดตั้งความสามารถเพิ่มเติม

เรียกใช้จากบรรทัดคำสั่ง

การอ้างอิง

สนับสนุนและร่วมพัฒนา

สัญญาอนุญาต

ผู้สนับสนุน

About

Uh oh!

Releases

Packages

Languages

License

babypiya2018/pythainlp

Folders and files

Latest commit

History

Repository files navigation

PyThaiNLP: Thai Natural Language Processing in Python

Capabilities

Installation

Stable release

Development pre-release

Fresh from dev branch

Install options

Command-line

Python 2 Users

Citations

Contribute to PyThaiNLP

Licenses

Sponsors

ภาษาไทย

ความสามารถ

ติดตั้ง

รุ่นเสถียร

รุ่นก่อนเผยแพร่ (pre-release)

รุ่นกำลังพัฒนา (dev branch)

การติดตั้งความสามารถเพิ่มเติม

เรียกใช้จากบรรทัดคำสั่ง

การอ้างอิง

สนับสนุนและร่วมพัฒนา

สัญญาอนุญาต

ผู้สนับสนุน

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages