Closed
Description
- Since there will be some API changes, this will be released as PyThaiNLP 2.0 (PyThaiNLP 1.8 -> PyThaiNLP 2.0 #154)
- Thai Text Classification Benchmarks https://github.com/PyThaiNLP/classification-benchmarks
New evaluation corpus
- Prachatai newspaper, from https://prachatai.com - with news category tags
- Wisesight-Sentiment Corpus, Thai Social Media Sentiment Dataset
- truevoice-intent, Intent Dataset from Customer Service Phone Calls Transcribed by TrueVoice's Mari
New features
- thai2fit 0.3 (formerly thai2vec) (started with commit 0a8a60d)
- Use fastai 1.0.22
- Pretrained model and inference will now use the same frozen set of words, for better accuracy
pythainlp.transliterate.transliterate
grapheme to phoneme (Add pythainlp.g2p.ipa #139)- New
NorvigSpellChecker
class - can be initialized with custom dictionary (อยากเพิ่มคำใน pythainlp.spell ครับ #119, Update Peter Norvig's spell checker to suggest words based on probability #137) pythainlp.util.thai_strftime
for date and time formatting (use standarddatetime.strftime
directives) (Utility functions: rearrange package locations + add thai_strftime() date and time formatter #160)- Installation options for extra dependency packages (Manage extra requires + merge g2p and romanization to one transliterate module #153, artagger installation workaround on Windows test (AppVeyor) + add unit test #157)
- Run
pip install pythainlp
for minimum dependency, just enough to run core functions of PyThaiNLP. Runpip install pythainlp[full]
to install every packages that required for extended functions (like machine-learnet name entity recognizer that rely on keras).
- Run
pythainlp.util.thaicheck
- Thai check Add check thai word #171- Add Orchid to Universal Dependencies 166b671
Bug fixes
- Fix
metasound
soundex to work as described in the Snae & Brückner (2009) paper. (Fix MetaSound + Adjust tokenizer selector + More documentation + clean code #135) - Fix Peter Norvig's spell checker probability of candidate words (Spell-Correct: Probability of all corrected words are the same #90)
Other improvements and optimizations
- (Upgrade ULMFiT-related codes to fastai 1.0) (Unable to follow the ThaiTokenizer document #136)
- Frequently used regular expressions are now precompiled [should be faster, need benchmark here] (Precompile frequently-used pattern/regexes (PAT_ENG, PAT_TCC, PAT_TWOCHARS) #124, Code cleaning + small optimization #133, Merge different soundex systems to one pythainlp.soundex module #138)
- Consolidate documentation files (Merging duplicated or closely related small documentation files together #128, Removed similar or confusing documentation files #129)
- Remove Python 2 compatibility code (deprecated in 1.7 - Deprecate Python 2 support #107) (Clean code, remove Python 2 compatibility code #134)
- Refactoring: reduce redundant and unused code, merged common code (Refactor tokenize code #125, Clean code (/corpus, /tools, /ulmfit) #132, Consistent naming and consolidate similar codes #146, Simplify bahttext() code #148, Number converters: convert more than one digit at a time #149)
- Remove temporary files, experiment files, and obsoleted files (.gitignore to ignore Jupyter notebook checkpoints and macOS generated files #126, Remove obsoleted, unused, and experimental codes #140, Remove unused codes and temp files, update docs #143)
- More consistent indentations in source code
- Handling None, empty value, errors, and unexpected cases:
- Check for None and empty values and make appropriate return when necessary (rank and soundex handles empty or None case #151, etc.)
- Raise
ImportError
, if there is import error, instead of sys.exit() - functions like
tokenize
,summarize
, etc. will always return something even the engine specified is not found (will fall back to default engine) (summarize: Small variable rename and handle engine not found case #131)
- More and improved examples (Move test folder out of main lib #122, Update examples and remove Jupyter notebook checkpoints #127)
- Improved test coverages with more test cases (Minor bug fixes + add test cases + update readme #147, More test cases - reached 80% coverage #156)
Name changes in API
- Rearrangement of utility functions. Most of them, like
rank
,find_keyword
,collate
, and functions related to date and time, are now inpythainlp.util
module. (Utility functions: rearrange package locations + add thai_strftime() date and time formatter #160) - Some class and function names are changed from 1.7 to make it aligned with PEP8 (Style Guide for Python Code), make it more explicit about what they are doing, or make it more consistent with other related classes/functions. For examples:
thainer
andthai2rom
classes are nowThaiNameTagger
andThaiTransliterator
(CapWords for class name)pythainlp.soundex.LK82
,pythainlp.soundex.Udom83
, andpythainlp.MetaSound
functions are nowpythainlp.soundex.lk82
,pythainlp.soundex.udom83
, andpythainlp.soundex.metasound
(small caps for function name, also move metasound to soundex module)collation
,correction
, andromanization
functions are nowcollate
,correct
, andromanize
-- in a verb (action) form, and in line withtokenize
andsummarize
functions.
pythainlp.corpus.alphabets
,pythainlp.corpus.tone
, etc. constants are nowpythainlp.thai_consonants
,pythainlp.thai_tonemarks
, etc.- They are also now
str
instead ofset
. - This is to follow the example of
string.ascii_letters
, etc.str
also iterate a little bit faster in one character for one member use cases that these constants are usually used for.
- They are also now
- These changes will resulted in breaking code if your code directly invoke those classes/functions. In general, the change should be only at the level of class or function name, there should be no change at the arguments passing to the class or the function. Please refer to the API doc.
- Internally, there are also name changes of corpus files (Naming convention for consistency วิธีการตั้งชื่อไฟล์ #141) but this should not has any effect to the API.