Skip to content

Latest commit

 

History

History
150 lines (114 loc) · 7.76 KB

corpus_license.md

File metadata and controls

150 lines (114 loc) · 7.76 KB

Corpus License

Dictionaries and Word Lists

The following word lists are created by the PyThaiNLP project and released under Creative Commons Zero 1.0 Universal Public Domain Dedication License https://creativecommons.org/publicdomain/zero/1.0/

Filename Description
countries_th.txt List of countries in Thai
etcc.txt List of Enhanced Thai Character Clusters
negations_th.txt Negation word list
stopwords_th.txt Stop word list
syllables_th.txt List of Thai syllables
thailand_provinces_th.csv List of Thailand provinces in Thai
tnc_freq.txt Words and their frequencies, from Thai National Corpus
ttc_freq.txt Words and their frequencies, from Thai Textbook Corpus
words_th.txt List of Thai words
words_th_thai2fit_201810.txt List of Thai words (frozen for thai2fit)

The following word lists are from Thai Male and Female Names Corpus https://github.com/korkeatw/thai-names-corpus/ by Korkeat Wannapat and released under their original licenses which are Creative Commons Attribution-ShareAlike 4.0 International Public License https://creativecommons.org/licenses/by-sa/4.0/

Filename Description
family_names_th.txt List of family names in Thailand
person_names_female_th.txt List of female names in Thailand
person_names_male_th.txt List of male names in Thailand

Models

The following language models are created by the PyThaiNLP project and released under Creative Commons Attribution 4.0 International Public License https://creativecommons.org/licenses/by/4.0/

Filename Description
pos_orchid_perceptron.json Part-of-speech tagging model, trained from ORCHID data, using perceptron
pos_orchid_unigram.json Part-of-speech tagging model, trained from ORCHID data, using unigram
pos_ud_perceptron-v0.2.json Part-of-speech tagging model, trained from Parallel Universal Dependencies treebank, using perceptron
pos_ud_unigram-v0.2.json Part-of-speech tagging model, trained from Parallel Universal Dependencies treebank, using unigram
sentenceseg_crfcut.model Sentence segmentation model, trained from TED subtitles, using CRF
tdtb-pt_tagger.json Part-of-speech tagging model, trained from The Thai Discourse Treebank, using perceptron
tdtb-unigram_tagger.json Part-of-speech tagging model, trained from The Thai Discourse Treebank, using unigram
pos_tud_perceptron.json Part-of-speech tagging model, trained from Thai Universal Dependency Treebank data, using perceptron
pos_tud_unigram.json Part-of-speech tagging model, trained from Thai Universal Dependency Treebank data, using unigram

Thai Dictionary for ICU BreakIterator

A Thai word list from ICU (International Components for Unicode) project (icubrk_th.txt) is copyrighted by Unicode, Inc. and others., released under Unicode License Agreement - Data Files and Software (2016) http://www.unicode.org/copyright.html

Original data: https://github.com/unicode-org/icu/blob/main/icu4c/source/data/brkitr/dictionaries/thaidict.txt

Thai WordNet

Thai WordNet (wordnet_th.db) is created by Thai Computational Linguistic Laboratory at National Institute of Information and Communications Technology (NICT), Japan, and released under the following license:

Copyright: 2011 NICT

Thai WordNet

This software and database is being provided to you, the LICENSEE, by
the National Institute of Information and Communications Technology
under the following license.  By obtaining, using and/or copying this
software and database, you agree that you have read, understood, and
will comply with these terms and conditions:

  Permission to use, copy, modify and distribute this software and
  database and its documentation for any purpose and without fee or
  royalty is hereby granted, provided that you agree to comply with
  the following copyright notice and statements, including the
  disclaimer, and that the same appear on ALL copies of the software,
  database and documentation, including modifications that you make
  for internal use or for distribution.

Thai WordNet Copyright 2011 by the National Institute of
Information and Communications Technology (NICT).  All rights
reserved.

THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS" AND NICT MAKES NO
REPRESENTATIONS OR WARRANTIES, EXPRESS OR IMPLIED.  BY WAY OF EXAMPLE,
BUT NOT LIMITATION, NICT MAKES NO REPRESENTATIONS OR WARRANTIES OF
MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE
OF THE LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT INFRINGE
ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR OTHER RIGHTS.

The name of the National Institute of Information and Communications
Technology may not be used in advertising or publicity pertaining to
distribution of the software and/or database.  Title to copyright in
this software, database and any associated documentation shall at all
times remain with National Institute of Information and Communications
Technology and LICENSEE agrees to preserve same.

For more information about Thai WordNet, see S. Thoongsup et al., ‘Thai WordNet construction’, in Proceedings of the 7th Workshop on Asian Language Resources, Suntec, Singapore, Aug. 2009, pp. 139–144. https://www.aclweb.org/anthology/W09-3420.pdf

Thai Wikipedia Titles

Thai Wikipedia titles corpus (wikipedia_titles.txt), prepared by konbraphat51, using a Thai Wikipedia dump from 21 November 2023, and released under their original license which is Creative Commons Attribution-ShareAlike 4.0 International Public License https://creativecommons.org/licenses/by-sa/4.0/

Original data: https://dumps.wikimedia.org/thwiki/latest/thwiki-latest-all-titles.gz

Preparation code: https://github.com/konbraphat51/Thai_Dictionary_Cleaner/

Volubilis

A corpus of Thai words registered in Volubilis dictionary (volubilis.txt), prepared by konbraphat51, using data from Volubilis 23.1 (Mar. 2023) by Francis Bastien, and released under their original license which is Creative Commons Attribution-ShareAlike 4.0 International Public License https://creativecommons.org/licenses/by-sa/4.0/

Original data: https://belisan-volubilis.blogspot.com/

Preparation code: https://github.com/konbraphat51/Thai_Dictionary_Cleaner/