-
Notifications
You must be signed in to change notification settings - Fork 282
Closed
Labels
documentationimprove documentation and test casesimprove documentation and test casesenhancementenhance functionalitiesenhance functionalities
Milestone
Description
- 2.1 released on 10 Dec 2019
- 2.1.1 released on 19 Dec 2019
- 2.1.2 released on 31 Dec 2019
- 2.1.3 released on 10 Jan 2020
- 2.1.4 released on 7 Feb 2020
Corpus
- Add Thai female, male names corpus (Add Thai female, male names corpus #217 [Corpus] Include Thai person names directly into the package #297) - thanks @korkeatw @c4n @bact
- Thai male, female names corpus https://github.com/korkeatw/thai-names-corpus
- General Election 2019 candidate names https://github.com/codeforthailand/dataset-election-62-candidates/tree/master/data
- Add
PYTHAINLP_DATA_DIR
environment variable to set location of downloaded data (default is~/pythainlp-data
) (add option of setting data dir with an enviromental variable #238 Added docs on PYTHAINLP_DATA_DIR environ variable #294) - thanks @dhpollack @abhabongse - Remove racing condition when create data directory (Remove racing condition in making pythainlp data directory #278) - thanks @abhabongse
Localization
- Add
pythainlp.util.thai_time
Time spell out to Thai words (Add pythainlp.util.thai_time #303) thanks @wannaphong @abhabongse @bact - Fix
bahttext
bug for a value of one million (bahttext not working for 1,000,000 #350) thanks @wannaphong
Tokenizer
pythainlp.tokenize.Tokenizer
is now immediately available whenimport pythainlp
(79432c2) - thanks @korakot- Add
ssg
, a CRF syllable segmentor (Questions on the implementation of syllable_tokenize #229 Alternative syllable tokenizer #237 Add ssg #242) - thanks @wannaphong @ponrawee @heytitle - Add
AttaCut
, a fast and accurate tokenizer, is now available throughengine="attacut"
inpythainlp.tokenize.word_tokenize()
(Integrate AttaCut to PyThaiNLP #258, add attacut to pythainlp/tokenize #261) - thanks @heytitle @bkktimber - Tokenization benchmark (Add tokenization-benchmark to PyThaiNLP #248 Tokenization benchmark miscalculate word-level metrics #268 Fix tokenization benchmark issue #269) - thanks @wannaphong @heytitle
- New engine
newmm-safe
forpythainlp.tokenize.word_tokenize()
- anewmm
engine with additional mechanism to avoid possible exponentially long wait for long text with a lot of ambiguity in breaking points. ("newmm-safe" option -- fix newmm issue, take too long time for long text with lots of ambiguity breaking points #302) - thanks @bact - Fix
newmm
engine, to help avoid possible long wait (Add graph size limit in _onecut() to avoid long wait for ambiguous text #333) (available in 2.1.1, backport from 2.2) - thanks @bact - Fix
longest
engine, last character is now consumed (Longest Match segment fails when the entire input text is a full word. #357) (available in 2.1.4 - thanks @bact
Spellchecker
- Avoid the spell check for numeric string (numeric string gives "ใน" as output in some length of string. #276 Fix "ใน" correction when pass numeric type into correct function in spell module #288) - thanks @nawaphonOHM @Peradon
Named-Entity Tagger
- Add html-like tag in output (NER: Add html-like tag in output #262 ThaiNER : The output of the html-like is incorrect. #346) - thanks @wannaphong
Dependency cleaning
Removing and updating many dependencies - thanks @c4n @artificiala @cstorm125 @korakot @bact @wannaphong
Remove:
keras
,tensorflow
(Port Thai2Rom from Keras to PyTorch #202 Thai2Rom on PyTorch (seq2seq no attention mechanism) #235 pytorch seq2seq implementation for Thai romanization #246) - Thai romanization is now implemented in PyTorchfastai
(Removefastai
from the dependencies #252) - removing and replacingpythainlp.ulmfit
preprocessing-related code with a self-implemented onemarisa-trie
(Change frommarisa-trie
to a Trie implementation written in python #277) - removing and replacing with native Trie implementationdeepcut
(Remove deepcut, keras, tensorflow from dependencies #283) - removing, word tokenizer still supportengine="deepcut"
but the user needs to install dependencies (deepcut
,keras
,tensorflow
) by themselves
Update:
artagger
(Use artagger from main repo, use tensorflow < 2 #281) - updating to use one from the main repo (was depends on a fork)- Include only direct dependencies in
setup.py
(Include only direct dependency in setup.py #275) - Push the version requirement for dependencies to the lowest possible (Minimum possible version requirement #292)
Documentation
- Docstring and type annotation fixes (pythainlp.spell: Fix type annotations, docstring spellings, etc. #279) - thanks @abhabongse
- Updated tutorial notebooks and moved to https://github.com/PyThaiNLP/tutorials (Remove tutorial notebooks from the PyThaiNLP/pythainlp repository (#270) #282) - thanks @artificiala @cstorm125
- Citation fix ([WIP] Update documents #284) - thanks @heytitle
- Docstring code and output example style changed, make it easier to copy & paste code (Improve document #293) - thanks @heytitle
Others
- Fix normalization, to include case THANTHAKHAT and SARA U, SARA UU (Include case THANTHAKHAT and SARA U, UU too #244) - thanks @korakot @ekapolc
- Better command-line interface (Better CLI #251, Update command_line.rst #271) - thanks @heytitle @wannaphong
- Improve code readability (Improve readability of some thai characters #287 Remove magic number #290 Redefine range loop #291) - thanks @boomsquared @Peradon
- Refactor:
- Refactor the test files (Issue #224: Refactor the test file #234) - thanks @artificiala
- Optimize keyboard layout switching translation code & digits translation (Optimize keyboard layout switching translation code & digits translation #280) - thanks @abhabongse
- Refactor util package as well as improve performance (Refactor util package as well as improve performance #295) - thanks @Peradon
Metadata
Metadata
Assignees
Labels
documentationimprove documentation and test casesimprove documentation and test casesenhancementenhance functionalitiesenhance functionalities