List PyThaiNLP 2.0

- Since there will be some API changes, this will be released as PyThaiNLP 2.0 (#154)
- Thai Text Classification Benchmarks https://github.com/PyThaiNLP/classification-benchmarks

## New evaluation corpus
- [Prachatai newspaper](https://github.com/PyThaiNLP/prachathai-67k), from https://prachatai.com - with news category tags
- [Wisesight-Sentiment Corpus](https://github.com/PyThaiNLP/wisesight-sentiment), Thai Social Media Sentiment Dataset
- [truevoice-intent](https://github.com/PyThaiNLP/truevoice-intent), Intent Dataset from Customer Service Phone Calls Transcribed by TrueVoice's Mari

## New features
- thai2fit 0.3 (formerly thai2vec) (started with commit 0a8a60d)
  - Use fastai 1.0.22
  - Pretrained model and inference will now use the same frozen set of words, for better accuracy
- ```pythainlp.transliterate.transliterate``` grapheme to phoneme (#139)
- New [```NorvigSpellChecker```](https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/spell/pn.py) class - can be initialized with custom dictionary (#119, #137)
- ```pythainlp.util.thai_strftime``` for date and time formatting (use standard ```datetime.strftime``` directives) (#160)
- Installation options for extra dependency packages (#153, #157)
  - Run ```pip install pythainlp``` for minimum dependency, just enough to run core functions of PyThaiNLP. Run ```pip install pythainlp[full]``` to install every packages that required for extended functions (like machine-learnet name entity recognizer that rely on keras).
- ```pythainlp.util.thaicheck``` - Thai check #171
- Add Orchid to Universal Dependencies https://github.com/PyThaiNLP/pythainlp/commit/166b6719166b0e163331309d0a48733f1d375364

## Bug fixes
- Fix [```metasound```](https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/soundex/metasound.py) soundex to work as described in the Snae & Brückner (2009) paper. (#135)
- Fix Peter Norvig's spell checker probability of candidate words (#90)

## Other improvements and optimizations
- (Upgrade ULMFiT-related codes to fastai 1.0) (#136)
- Frequently used regular expressions are now precompiled [should be faster, need benchmark here] (#124, #133, #138)
- Consolidate documentation files (#128, #129)
- Remove Python 2 compatibility code (deprecated in 1.7 - #107) (#134)
- Refactoring: reduce redundant and unused code, merged common code (#125, #132, #146, #148, #149)
- Remove temporary files, experiment files, and obsoleted files (#126, #140, #143)
- More consistent indentations in source code
- Handling None, empty value, errors, and unexpected cases:
  - Check for None and empty values and make appropriate return when necessary (#151, etc.)
  - Raise ```ImportError```, if there is import error, instead of sys.exit()
  - functions like ```tokenize```, ```summarize```, etc. will always return something even the engine specified is not found (will fall back to default engine) (#131)
- More and improved examples (#122, #127)
- Improved test coverages with more test cases (#147, #156)

## Name changes in API
- Rearrangement of utility functions. Most of them, like ```rank```, ```find_keyword```, ```collate```, and functions related to date and time, are now in ```pythainlp.util``` module. (#160)
- Some class and function names are changed from 1.7 to make it aligned with [PEP8 (Style Guide for Python Code)](https://www.python.org/dev/peps/pep-0008/#names-to-avoid), make it more explicit about what they are doing, or make it more consistent with other related classes/functions. For examples:
  - ```thainer``` and ```thai2rom``` classes are now ```ThaiNameTagger``` and ```ThaiTransliterator``` (CapWords for class name)
  - ```pythainlp.soundex.LK82```, ```pythainlp.soundex.Udom83```, and ```pythainlp.MetaSound``` functions are now ```pythainlp.soundex.lk82```, ```pythainlp.soundex.udom83```, and ```pythainlp.soundex.metasound``` (small caps for function name, also move metasound to soundex module)
  -  ```collation```, ```correction```, and ```romanization``` functions are now [```collate```](https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/collation/__init__.py), [```correct```](https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/spell/pn.py), and [```romanize```](https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/romanization/__init__.py) -- in a verb (action) form, and in line with ```tokenize``` and ```summarize``` functions.
- ```pythainlp.corpus.alphabets```, ```pythainlp.corpus.tone```, etc. constants are now ```pythainlp.thai_consonants```, ```pythainlp.thai_tonemarks```, etc.
  - They are also now ```str``` instead of ```set```.
  - This is to follow the example of ```string.ascii_letters```, etc. ```str``` also iterate a little bit faster in one character for one member use cases that these constants are usually used for.
- These changes will resulted in breaking code if your code directly invoke those classes/functions. In general, the change should be only at the level of class or function name, there should be no change at the arguments passing to the class or the function. Please refer to the API doc.
- Internally, there are also name changes of corpus files (#141) but this should not has any effect to the API.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

List PyThaiNLP 2.0 #118

New evaluation corpus

New features

Bug fixes

Other improvements and optimizations

Name changes in API

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

List PyThaiNLP 2.0 #118

Description

New evaluation corpus

New features

Bug fixes

Other improvements and optimizations

Name changes in API

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions