Closed
Description
Schedule
- First Beta release: 5 February 2024
- Production release: 10 February 2024
See 5.0 Milestone.
What is new?
License information
- Use SPDX license identifier at the header of source code switch PyThaiNLP source code to SPDX license ID #876
Deprecation and other API changes
- Change default NER to thainer-v2 5e97e7c
- Move
pythainlp.util.is_native_thai
topythainlp.morpheme.is_native_thai
524759a
Dependency
- Add tzdata as a dependency on Windows by @BLKSerene in Add tzdata as a dependency on Windows #841
New API
- Add
pythainlp.coref
for Thai coreference resolution Add pythainlp.coref #802 - Add
wtpsplit
to sentence segmentation & paragraph segmentation Add wtpsplit to sentence segmentation & paragraph segmentation #804 and addparagraph_threshold
intoparagraph_tokenize()
function addparagraph_threshold
intoparagraph_tokenize
function #806 - Add word approximation to
pythainlp.soundex.sound
Add word approximation to pythainlp.soundex.sound #809 by @wannaphong - Add
pythainlp.wsd
for Thai word sense disambiguation Add pythainlp.wsd for Thai Word Sense Disambiguation #818 by @wannaphong - Add
pythainlp.chat
andWangChanGLM
topythainlp.generate
Add pythainlp.chat and WangChanGLM to pythainlp.generate #819 by @wannaphong - Add
pythainlp.cls
a param-free classification model Add a param-free classification model #821 by @c4n - Add
pythainlp.el
entity linking Add pythainlp.el #822 by @wannaphong - Add
pythainlp.ancient
by @wannaphong in Add pythainlp.ancient #833 - Add
pythainlp.util.rhyme
by @wannaphong in Add pythainlp.util.rhyme #849 - Add:
remove_trailing_repeat_consonants
by @konbraphat51 in Add: remove_trailing_repeat_consonants() #862 - Add
pythainlp.util.to_idn
by @wannaphong in Add pythainlp.util.to_idn #875 - Add
pythainlp.corpus.find_synonyms
by @wannaphong in Add pythainlp.corpus.find_synonyms #890 - Add
pythainlp.util.morse
by @wannaphong in Add pythainlp.util.morse #891 - Add
pythainlp.morpheme
by @wannaphong in Add pythainlp.morpheme #896
Improve
- Update code comments and clean up codes by @BLKSerene in Update code comments and clean up codes #845
- Improving the documentation byt fixing the typos, adding necesarry details and explanation of the code and the missing necessary details about model and example. by @Saharshjain78 in Improving the documentation byt fixing the typos, adding necesarry details and explanation of the code and the missing necessary details about model and example. #850
- Fix tests of khavee functions by @BLKSerene in Fix tests of khavee functions #854
- Update Git Actions versions by @bact in Update Git Actions versions #878
- Fix ruff args in workflow by @bact in Fix ruff args in workflow #880
- Revise ruff args in workflow by @bact in Revise ruff args in workflow #881
- Fix coref return type and add fallback by @bact in Fix coref return type and add fallback #883
- Fix wrong/incompatible types, code readability by @bact in Fix wrong/incompatible types, code readability #884
- Bump protobuf from 3.20 to 3.20.2 by Bump protobuf from 3.20 to 3.20.2 #885
- Add license info to /tests and README_TH.md by @bact in Add license info to /tests and README_TH.md #886
- phayathaibert, khavee, parse: Code clean up by @bact in phayathaibert, khavee, parse: Code clean up #889
- ruff: docstring-code-format = true by @bact in ruff: docstring-code-format = true #892
Tokenizer
- Add wtpsplit engine to sentence_tokenize Add wtpsplit to sentence segmentation & paragraph segmentation #804
- New
paragraph_tokenize
funtion to split Thai text to a paragraph Add wtpsplit to sentence segmentation & paragraph segmentation #804 - Add
paragraph_threshold
intoparagraph_tokenize()
function addparagraph_threshold
intoparagraph_tokenize
function #806 by @pavaris-pm in - Add 🪿 Han-solo by @wannaphong in Add 🪿 Han-solo #830
- Fix
newmm
to better handle non-Thai characters in tokens Improve: [newmm tokenizer] Change regular expression of "non-thai-characters" #856 by @konbraphat51 - Fix incorrect passing of flags to re.split by @hauntsaninja in Fix incorrect passing of flags to re.split #832
- Add syllable_tokenize by @wannaphong in Add syllable_tokenize #834
- Add wanchanberta_thai_grammarly by @wannaphong in Add wanchanberta_thai_grammarly #836
- Add extra segmentation style for paragraph_tokenize function by @pavaris-pm in Add extra segmentation style for
paragraph_tokenize
function #844 - Improve: [newmm tokenizer] Change regular expression of "non-thai-characters" by @konbraphat51 in Improve: [newmm tokenizer] Change regular expression of "non-thai-characters" #856
Tag
- add function for pos tag with transformers by @MpolaarbearM in add function for pos tag with transformers #857
- Update pos_tag_transformers function by @pavaris-pm in Update
pos_tag_transformers
function #865 - Add PhayaThaiBERT engine with new features by @pavaris-pm in Add PhayaThaiBERT engine with new features [WIP] #873
Chat
Translate
- Add small100 to
pythainlp.translate
Add small100 to pythainlp.translate #815 by @wannaphong
Transliterate
- Fix duplicate keys in ISO 11940 and IPA-RTGS phoneme mapping Fix ISO 11940 duplicate keys #851 Fix duplicate key in IPA to RTGS phoneme mapping #852 by @BLKSerene and @bact
- Fix duplicate key in IPA to RTGS phoneme mapping by @BLKSerene in Fix duplicate key in IPA to RTGS phoneme mapping #852
Corpus
- Add
pythainlp.corpus.thai_orst_words()
Thai word list from Royal Society of Thailand (ORST) Add Thai word list from Royal Society of Thailand (ORST) #810 by @wannaphong - Add
pythainlp.corpus.thai_wikipedia_titles()
Thai word list (noun and noun phrases) from Thai Wikipedia titles Add Thai word list from Thai Wikipedia titles #869 by @konbraphat51 - Add
pythainlp.corpus.thai_volubilis_words()
Thai word list from Volubilis dictionary Add Thai word list from Volubilis dictionary #870 by @konbraphat51 - Add
pythainlp.corpus.thai_icu_words()
Thai word list from ICU BreakIterator dictionary Add Thai word list from ICU BreakIterator dictionary #879 by @pavaris-pm - Rename Volubilis/Wikipedia corpus function names for consistency / Fix types by @bact in Rename Volubilis/Wikipedia corpus function names for consistency / Fix types #882
Util
- Add
pythainlp.util.encoding
Add pythainlp.util.encoding #813 by @wannaphong - Add
pythainlp.util.spell_words
Add pythainlp.util.spell_words #817 by @wannaphong - Add
pythainlp.util.remove_trailing_repeat_consonants()
Add: remove_trailing_repeat_consonants() #862 by @konbraphat51
New Contributors
- @pavaris-pm made their first contribution in add
paragraph_threshold
intoparagraph_tokenize
function #806 - @hauntsaninja made their first contribution in Fix incorrect passing of flags to re.split #832
- @Saharshjain78 made their first contribution in Improving the documentation byt fixing the typos, adding necesarry details and explanation of the code and the missing necessary details about model and example. #850
- @konbraphat51 made their first contribution in Improve: [newmm tokenizer] Change regular expression of "non-thai-characters" #856
- @MpolaarbearM made their first contribution in add function for pos tag with transformers #857
Full Changelog: v4.0.2...v5.0.0
Contributors
Thanks all the contributors. (Image made with contributors-img)
If you want to contributing to PyThaiNLP, you can read Contributing to PyThaiNLP.