-
Notifications
You must be signed in to change notification settings - Fork 287
Closed
Labels
documentationimprove documentation and test casesimprove documentation and test cases
Milestone
Description
Schedule
- First development release: 1 May 2020 - 2.2.0-dev0
- Beta release: 15 June 2020 - 2.2.0-beta1
- Production release: 24 June 2020 - 2.2.0
- Bug fix: 27 June 2020 - 2.2.1
- Bug fix: 10 July 2020 - 2.2.2
- Bug fix: 2 Aug 2020 - 2.2.3
- Bug fix: 2 Aug 2020 - 2.2.4
- Bug fix: 17 Sep 2020 - 2.2.4
- Bug fix: 16 Nov 2020 - 2.2.5
- Bug fix: 13 Dec 2020 - 2.2.6
See 2.2 Milestone.
Tokenization
- Add graph size limit in _onecut() to avoid long wait for ambiguous text #333 Add graph size limit in newmm's
_onecut()to avoid long wait for ambiguous text (also back ported to 2.1.1) - Longest Match segment fails when the entire input text is a full word. #357 Fix longest engine, last character is now consumed -- thanks @ciaranbyrne for the report
- Add CRFCut sentence segmentation #337 Add crfcut v2 model #380 Add CRFCut sentence segmentation -- thanks @opalchonlapat for the improved model
Transliteration
- Add Thai G2P #377 Add Thai Grapheme-to-Phoneme (Thai G2P) deep learning sequence-to-sequence model
Normalization
- Generalized reorder rules in text normalization #372, Add a function to remove zero-width characters #373, Add a more advanced normalize function #374 Add more normalize functions, like remove zero-width characters, remove duplicate spaces, etc.
Utilities
- Add thai_day2datetime and thai_time2time #334 Add
thaiword_to_date()andthaiword_to_time() - countthai not cover all cases #398 Fix
countthai()to handle a case where the text has only numbers and symbols -- thanks @opalthailand for the report
Command line
- CLI commands should be verbs/actions #342 update command and sub-command syntax - see command line docs
Deprecation and other API changes
- Delete deprecated code #376 Deprecation: Remove deprecated functions:
thaicheck()-- useis_native_thai()insteaddeletetone()-- useremove_tonemark()dict_word_tokenize()-- useword_tokenize(custom_dict=)
- Move non-init code out of __init__.py files #379, Reduce Trie __init__ complexity #381 Trie API change:
- Deprecating
pythainlp.tokenize.Trie, usepythainlp.util.Trieinstead - Add
add()toTrieclass. Now can add more words after the instantiation of Trie object
- Deprecating
Dependencies
- [DEP] NumPy required in base install #353 Remove
numpyandpandasrequirements from base install -- thanks @mmaybeno and @PNNutkung for the report - [DEP] Review for the removal of NLTK from requirements #400 Remove
nltkrequirements from base install. WordNet API needs NLTK, to use itpip install pythainlp[wordnet] - [DEP] Choose one crfsuite module #404, Port Thai NER from sklearn-crfsuite to python-crfsuite #407 Remove
sklearn-crfsuiterequirements. Port Thai NER fromsklearn-crfsuitetopython-crfsuite. - [DEP] Remove dill dependency #405 Remove
dillrequirements - [DEP] Provide tqdm fallback for progress bar #406 Remove
tqdmrequirements. tqdm will be use to show progress bar when found. If not found, use a fallback progress report
Others
- Make __init__.py small, having only necessary stuffs #378, Reduce redundant processing #388, Deprecation warning due to invalid escape sequences #394 Code improvement: Move non-init code out of
__init__.pyfiles, etc. - Refactor NorvigSpellChecker class #396, Add dict and other iterables support for custom_dict input #438 Refactor
NorvigSpellCheckerclass, add more type supports forcustom_dictin the constructor - Rewrite Unigram pos_tag #401 Remove dependency: Unigram POS tagger no longer need
NLTKmodule - Properly check if download() is needed in get_corpus_path() #414 Refactor get_corpus_path() and update its logic to guarantee that the corpus will be downloaded as intended
- Add more word to words_th.txt #434 Add more words to
words_th.txt(used in dictionary-based word segmentation)
Metadata
Metadata
Assignees
Labels
documentationimprove documentation and test casesimprove documentation and test cases