- Corpora, datasets, and documentation created by PyThaiNLP project are released under Creative Commons Zero 1.0 Universal Public Domain Dedication License (CC0).
- Language models created by PyThaiNLP project are released under Creative Commons Attribution 4.0 International Public License (CC-by).
- For more information about corpora that PyThaiNLP use, see https://github.com/PyThaiNLP/pythainlp-corpus/.
The following word lists are created by the PyThaiNLP project and released under Creative Commons Zero 1.0 Universal Public Domain Dedication License https://creativecommons.org/publicdomain/zero/1.0/
Filename | Description |
---|---|
countries_th.txt | List of countries in Thai |
etcc.txt List of | Enhanced Thai Character Clusters |
negations_th.txt | Negation word list |
stopwords_th.txt | Stop word list |
syllables_th.txt | List of Thai syllables |
thailand_provinces_th.csv | List of Thailand provinces in Thai |
tnc_freq.txt | Words and their frequencies, from Thai National Corpus |
ttc_freq.txt | Words and their frequencies, from Thai Textbook Corpus |
words_th.txt | List of Thai words |
words_th_thai2fit_201810.txt | List of Thai words (frozen for thai2fit) |
The following word lists are from Thai Male and Female Names Corpus https://github.com/korkeatw/thai-names-corpus/ by Korkeat Wannapat and released under their original licenses which are Creative Commons Attribution-ShareAlike 4.0 International Public License https://creativecommons.org/licenses/by-sa/4.0/
Filename | Description |
---|---|
family_names_th.txt | List of family names in Thailand |
person_names_female_th.txt | List of female names in Thailand |
person_names_male_th.txt | List of male names in Thailand |
The following language models are created by the PyThaiNLP project and released under Creative Commons Attribution 4.0 International Public License https://creativecommons.org/licenses/by/4.0/
Filename | Description |
---|---|
pos_orchid_perceptron.json | Part-of-speech tagging model, trained from ORCHID data, using perceptron |
pos_orchid_unigram.json | Part-of-speech tagging model, trained from ORCHID data, using unigram |
pos_ud_perceptron-v0.2.json | Part-of-speech tagging model, trained from Parallel Universal Dependencies treebank, using perceptron |
pos_ud_unigram-v0.2.json | Part-of-speech tagging model, trained from Parallel Universal Dependencies treebank, using unigram |
sentenceseg_crfcut.model | Sentence segmentation model, trained from TED subtitles, using CRF |
tdtb-pt_tagger.json | Part-of-speech tagging model, trained from The Thai Discourse Treebank, using perceptron |
tdtb-unigram_tagger.json | Part-of-speech tagging model, trained from The Thai Discourse Treebank, using unigram |
pos_tud_perceptron.json | Part-of-speech tagging model, trained from Thai Universal Dependency Treebank data, using perceptron |
pos_tud_unigram.json | Part-of-speech tagging model, trained from Thai Universal Dependency Treebank data, using unigram |
A Thai word list from ICU (International Components for Unicode) project (icubrk_th.txt) is copyrighted by Unicode, Inc. and others., released under Unicode License Agreement - Data Files and Software (2016) http://www.unicode.org/copyright.html
Original data: https://github.com/unicode-org/icu/blob/main/icu4c/source/data/brkitr/dictionaries/thaidict.txt
Thai WordNet (wordnet_th.db) is created by Thai Computational Linguistic Laboratory at National Institute of Information and Communications Technology (NICT), Japan, and released under the following license:
Copyright: 2011 NICT
Thai WordNet
This software and database is being provided to you, the LICENSEE, by
the National Institute of Information and Communications Technology
under the following license. By obtaining, using and/or copying this
software and database, you agree that you have read, understood, and
will comply with these terms and conditions:
Permission to use, copy, modify and distribute this software and
database and its documentation for any purpose and without fee or
royalty is hereby granted, provided that you agree to comply with
the following copyright notice and statements, including the
disclaimer, and that the same appear on ALL copies of the software,
database and documentation, including modifications that you make
for internal use or for distribution.
Thai WordNet Copyright 2011 by the National Institute of
Information and Communications Technology (NICT). All rights
reserved.
THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS" AND NICT MAKES NO
REPRESENTATIONS OR WARRANTIES, EXPRESS OR IMPLIED. BY WAY OF EXAMPLE,
BUT NOT LIMITATION, NICT MAKES NO REPRESENTATIONS OR WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE
OF THE LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT INFRINGE
ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR OTHER RIGHTS.
The name of the National Institute of Information and Communications
Technology may not be used in advertising or publicity pertaining to
distribution of the software and/or database. Title to copyright in
this software, database and any associated documentation shall at all
times remain with National Institute of Information and Communications
Technology and LICENSEE agrees to preserve same.
For more information about Thai WordNet, see S. Thoongsup et al., ‘Thai WordNet construction’, in Proceedings of the 7th Workshop on Asian Language Resources, Suntec, Singapore, Aug. 2009, pp. 139–144. https://www.aclweb.org/anthology/W09-3420.pdf
Thai Wikipedia titles corpus (wikipedia_titles.txt), prepared by konbraphat51, using a Thai Wikipedia dump from 21 November 2023, and released under their original license which is Creative Commons Attribution-ShareAlike 4.0 International Public License https://creativecommons.org/licenses/by-sa/4.0/
Original data: https://dumps.wikimedia.org/thwiki/latest/thwiki-latest-all-titles.gz
Preparation code: https://github.com/konbraphat51/Thai_Dictionary_Cleaner/
A corpus of Thai words registered in Volubilis dictionary (volubilis.txt), prepared by konbraphat51, using data from Volubilis 23.1 (Mar. 2023) by Francis Bastien, and released under their original license which is Creative Commons Attribution-ShareAlike 4.0 International Public License https://creativecommons.org/licenses/by-sa/4.0/
Original data: https://belisan-volubilis.blogspot.com/
Preparation code: https://github.com/konbraphat51/Thai_Dictionary_Cleaner/