Merge pull request #157 from bact/dev

bact · web-flow · commit 8ab13d2e1634 · 2018-11-13T20:32:10.000+07:00
artagger installation workaround on Windows test (AppVeyor) + add unit test
diff --git a/.travis.yml b/.travis.yml
@@ -13,7 +13,7 @@ before_install:
 # command to install dependencies, e.g. pip install -r requirements.txt --use-mirrors
 install:
   - pip install -r requirements.txt
-  - pip install .[icu,ipa,ner,thai2vec]
+  - pip install .[artagger,icu,ipa,ner,thai2vec]
   - pip install coveralls
 
 os:
diff --git a/README.md b/README.md
@@ -51,19 +51,21 @@ $ pip install pythainlp[extra1,extra2,...]
 ```
 
 where ```extras``` can be
-  - ```artagger``` (to support artagger part-of-speech tagger)
+  - ```artagger``` (to support artagger part-of-speech tagger)*
   - ```deepcut``` (to support deepcut machine-learnt tokenizer)
   - ```icu``` (for ICU support in transliteration and tokenization)
   - ```ipa``` (for International Phonetic Alphabet support in transliteration)
-  - ```ml``` (to support ULMFit models, like one for sentiment analyser)
+  - ```ml``` (to support ULMFiT models, like one for sentiment analyser)
   - ```ner``` (for named-entity recognizer)
   - ```thai2rom``` (for machine-learnt romanization)
   - ```thai2vec``` (for Thai word vector)
   - ```full``` (install everything)
 
-see ```extras``` and ```extras_require``` in [```setup.py```](https://github.com/PyThaiNLP/pythainlp/blob/dev/setup.py) for details.
+* Note: standard ```artagger``` package from PyPI will not work on Windows, please ```pip install https://github.com/wannaphongcom/artagger/tarball/master#egg=artagger``` instead.
 
-Development release:
+** see ```extras``` and ```extras_require``` in [```setup.py```](https://github.com/PyThaiNLP/pythainlp/blob/dev/setup.py) for package details.
+
+### Development release:
 
 ```sh
 $ pip install https://github.com/PyThaiNLP/pythainlp/archive/dev.zip
@@ -94,7 +96,7 @@ PyThaiNLP เป็นไลบารีภาษาไพทอนเพื่
 
 > เพราะโลกขับเคลื่อนต่อไปด้วยการแบ่งปัน
 
-รองรับ Python 3.4 ขึ้นไป
+รองรับ Python 3.6 ขึ้นไป
 
 ตั้งแต่รุ่น 1.7 PyThaiNLP จะเลิกสนับสนุน Python 2 (บางฟังก์ชันอาจยังทำงานได้ แต่จะไม่ได้รับการสนับสนุน) และตั้งแต่รุ่น 1.8 จะยุติการรองรับ Python 2 ทั้งหมด
 ผู้ใช้ Python 2 ยังสามารถใช้ PyThaiNLP 1.6 ได้
@@ -117,19 +119,39 @@ PyThaiNLP เป็นไลบารีภาษาไพทอนเพื่
 
 ## ติดตั้ง
 
-รุ่นเสถียร
+### รุ่นเสถียร
 
 ```sh
 $ pip install pythainlp
 ```
 
-รุ่นกำลังพัฒนา
+สำหรับความสามารถเพิ่มเติมบางอย่าง เช่น word vector จำเป็นต้องติดตั้งแพคเกจสนับสนุนเพิ่มเติม ติดตั้งแพคเพจเหล่านั้นได้ ด้วยการระบุออปชันเหล่านี้ตอน pip install:
 
 ```sh
-$ pip install https://github.com/PyThaiNLP/pythainlp/archive/dev.zip
+$ pip install pythainlp[extra1,extra2,...]
 ```
 
-หมายเหตุ: เนื่องจาก ulmfit sentiment analyser ต้องใช้ PyTorch จึงต้อง ```pip install torch``` เพื่อติดตั้ง PyTorch ก่อน มอดูลที่อาศัยการเรียนรู้ของเครื่องอื่นๆ อาจจำเป็นต้องติดตั้ง gensim และ keras ก่อนเช่นกัน
+โดยที่ ```extras``` คือ
+  - ```artagger``` (สำหรับตัวติดป้ายกำกับชนิดคำ artagger)*
+  - ```deepcut``` (สำหรับตัวตัดคำ deepcut)
+  - ```icu``` (สำหรับการถอดตัวสะกดเป็นสัทอักษรและการตัดคำด้วย ICU)
+  - ```ipa``` (สำหรับการถอดตัวสะกดเป็นสัทอักษรสากล (IPA))
+  - ```ml``` (สำหรับการรองรับโมเดล ULMFiT ซึ่งใช้ในฟังก์ชันเช่นการวิเคราะห์อารมณ์)
+  - ```ner``` (สำหรับการติดป้ายชื่อเฉพาะ (named-entity))
+  - ```thai2rom``` (สำหรับการถอดตัวสะกดเป็นอักษรละติน)
+  - ```thai2vec``` (สำหรับ word vector)
+  - ```full``` (ติดตั้งทุกอย่าง)
+
+* หมายเหตุ: แพคเกจ ```artagger``` มาตรฐานจาก PyPI อาจมีปัญหาการถอดรหัสข้อความบน Windows กรุณาติดตั้งรุ artagger รุ่นแก้ไขด้วยคำสั่ง ```pip install https://github.com/wannaphongcom/artagger/tarball/master#egg=artagger``` แทน ก่อนจะติดตั้ง PyThaiNLP
+
+** นักพัฒนาสามารถดู ```extras``` และ ```extras_require``` ใน [```setup.py```](https://github.com/PyThaiNLP/pythainlp/blob/dev/setup.py) สำหรับรายละเอียดแพคเกจของเสริม
+
+
+### รุ่นกำลังพัฒนา
+
+```sh
+$ pip install https://github.com/PyThaiNLP/pythainlp/archive/dev.zip
+```
 
 ## เอกสารการใช้งาน
 
diff --git a/appveyor.yml b/appveyor.yml
@@ -5,22 +5,23 @@ environment:
     - PYTHON: "C:/Python36"
       PYTHON_VERSION: "3.6"
       PYTHON_ARCH: "32"
-      PYICU_WHEEL: "https://get.openlp.org/win-sdk/PyICU-1.9.7-cp36-cp36m-win32.whl"
+      PYICU_PKG: "https://get.openlp.org/win-sdk/PyICU-1.9.7-cp36-cp36m-win32.whl"
+      ARTAGGER_PKG: "https://github.com/wannaphongcom/artagger/tarball/master#egg=artagger"
 
     #- PYTHON: "C:/Python36-x64"
     #  PYTHON_VERSION: "3.6"
     #  PYTHON_ARCH: "64"
-    #  PYICU_WHEEL: ""
+    #  PYICU_PKG: ""
 
     #- PYTHON: "C:/Python37"
     #  PYTHON_VERSION: "3.7"
     #  PYTHON_ARCH: "32"
-    #  PYICU_WHEEL: ""
+    #  PYICU_PKG: ""
 
     #- PYTHON: "C:/Python37-x64"
     #  PYTHON_VERSION: "3.7"
     #  PYTHON_ARCH: "64"
-    #  PYICU_WHEEL: ""
+    #  PYICU_PKG: ""
 
 init:
   - "ECHO %PYTHON% %PYTHON_VERSION% %PYTHON_ARCH%"  
@@ -31,8 +32,9 @@ install:
   - "set PYTHONIOENCODING=utf-8"
   # - "set ICU_VERSION=62"
   - "%PYTHON%/python.exe -m pip install --upgrade pip"
-  - "%PYTHON%/python.exe -m pip install %PYICU_WHEEL%"
-  - "%PYTHON%/python.exe -m pip install -e .[icu,ipa,ner,thai2vec]"
+  - "%PYTHON%/python.exe -m pip install %PYICU_PKG%"
+  - "%PYTHON%/python.exe -m pip install %ARTAGGER_PKG%"
+  - "%PYTHON%/python.exe -m pip install -e .[artagger,icu,ipa,ner,thai2vec]"
 
 test_script:
   - "%PYTHON%/python.exe -m pip --version"
diff --git a/pythainlp/tag/__init__.py b/pythainlp/tag/__init__.py
@@ -3,14 +3,12 @@
 Part-Of-Speech tagger
 """
 
-_ARTAGGER_URL = "https://github.com/wannaphongcom/artagger/archive/master.zip"
-
 
 def pos_tag(words, engine="unigram", corpus="orchid"):
     """
     Part of Speech tagging function.
 
-    :param list words: takes in a list of tokenized words (put differently, a list of strings)
+    :param list words: a list of tokenized words
     :param str engine:
         * unigram - unigram tagger (default)
         * perceptron - perceptron tagger
diff --git a/pythainlp/tag/perceptron.py b/pythainlp/tag/perceptron.py
@@ -29,6 +29,7 @@ def tag(words, corpus="pud"):
     if not words:
         return []
 
+    # perceptron tagger cannot handle empty string
     words = [word.strip() for word in words if word.strip()]
 
     if corpus == "orchid":
diff --git a/pythainlp/transliterate/royin.py b/pythainlp/transliterate/royin.py
@@ -110,22 +110,16 @@
 }
 
 _RE_CONSONANT = re.compile(r"[ก-ฮ]")
-_RE_KARAN = re.compile(r"จน์|มณ์|ณฑ์|ทร์|ตร์|[ก-ฮ]์|[ก-ฮ][ะ-ู]์")
-_RE_KARAN2 = re.compile(r"\w" + r"์")
-_RE_YAMOK_PAIYANNOI = re.compile(r"[ๆฯ]")
-_RE_TONE = re.compile(r"[่-๋]")
+_RE_NORMALIZE = re.compile(
+    r"จน์|มณ์|ณฑ์|ทร์|ตร์|[ก-ฮ]์|[ก-ฮ][ะ-ู]์"
+    # yamok, paiyannoi, thanthakhat, yamakkan, tonemarks, other signs
+    + r"|[\u0e2f\u0e46\u0e48\u0e49\u0e4a\u0e4b\u0e4c\u0e4d\u0e4e\u0e4f\u0e5a\u0e5b]"
+)
 
 
 def _normalize(text):
     """ตัดอักษรที่ไม่ออกเสียง (การันต์ ไปยาลน้อย ไม้ยมก*) และวรรณยุกต์ทิ้ง"""
-    text = _RE_KARAN.sub("", text)
-    text = _RE_YAMOK_PAIYANNOI.sub("", text)
-    text = _RE_TONE.sub("", text)
-    if re.search(_RE_KARAN2, text):
-        karans = re.findall(_RE_KARAN2, text)
-        for karan in karans:
-            text = re.sub(karan, "", text)
-    return text
+    return _RE_NORMALIZE.sub("", text)
 
 
 def _replace_vowels(word):
@@ -173,11 +167,14 @@ def romanize(word):
         return ""
 
     word2 = _replace_vowels(_normalize(word))
-    res = re.findall(_RE_CONSONANT, word2)
+    res = _RE_CONSONANT.findall(word2)
+
     # 2-character word, all consonants
     if len(word2) == 2 and len(res) == 2:
         word2 = list(word2)
         word2.insert(1, "o")
         word2 = "".join(word2)
+
     word2 = _replace_consonants(word2, res)
+
     return word2
diff --git a/pythainlp/ulmfit/utils.py b/pythainlp/ulmfit/utils.py
@@ -197,11 +197,11 @@ def merge_wgts(em_sz, wgts, itos_pre, itos_cls):
 # feature extractor
 def document_vector(ss, m, stoi, tok_engine="newmm"):
     """
-    :meth: `document_vector` get document vector using pretrained ULMFit model
+    :meth: `document_vector` get document vector using pretrained ULMFiT model
     :param str ss: sentence to extract embeddings
     :param m: pyTorch model
     :param dict stoi: string-to-integer dict e.g. {'_unk_':0, '_pad_':1,'first_word':2,'second_word':3,...}
-    :param str tok_engine: tokenization engine (recommend using `newmm` if you are using pretrained ULMFit model)
+    :param str tok_engine: tokenization engine (recommend using `newmm` if you are using pretrained ULMFiT model)
     :return: `numpy.array` of document vector sized 300
     """
     s = word_tokenize(ss)
diff --git a/pythainlp/word_vector/thai2vec.py b/pythainlp/word_vector/thai2vec.py
@@ -69,7 +69,7 @@ def about():
     return """
     thai2vec
     State-of-the-Art Language Modeling, Text Feature Extraction and Text Classification in Thai Language.
-    Created as part of pyThaiNLP with ULMFit implementation from fast.ai
+    Created as part of PyThaiNLP with ULMFiT implementation from fast.ai
 
     Development: Charin Polpanumas
     GitHub: https://github.com/cstorm125/thai2vec
diff --git a/setup.cfg b/setup.cfg
@@ -7,8 +7,8 @@ tag = True
 description-file = README.md
 
 [bumpversion:file:setup.py]
-search = version='{current_version}'
-replace = version='{new_version}'
+search = version = '{current_version}'
+replace = version = '{new_version}'
 
 [bumpversion:file:pythainlp/__init__.py]
 search = __version__ = '{current_version}'
diff --git a/tests/__init__.py b/tests/__init__.py
@@ -314,15 +314,14 @@ def test_pos_tag(self):
         self.assertEqual(pos_tag(None), [])
         self.assertEqual(pos_tag([]), [])
 
-        self.assertIsNotNone(pos_tag(tokens, engine="unigram", corpus="orchid"))
-        self.assertIsNotNone(pos_tag(tokens, engine="unigram", corpus="pud"))
-        self.assertIsNotNone(pos_tag([""], engine="unigram", corpus="pud"))
-
         self.assertEqual(unigram.tag(None, corpus="pud"), [])
         self.assertEqual(unigram.tag([], corpus="pud"), [])
         self.assertEqual(unigram.tag(None, corpus="orchid"), [])
         self.assertEqual(unigram.tag([], corpus="orchid"), [])
 
+        self.assertIsNotNone(pos_tag(tokens, engine="unigram", corpus="orchid"))
+        self.assertIsNotNone(pos_tag(tokens, engine="unigram", corpus="pud"))
+        self.assertIsNotNone(pos_tag([""], engine="unigram", corpus="pud"))
         self.assertEqual(
             pos_tag(word_tokenize("คุณกำลังประชุม"), engine="unigram"),
             [("คุณ", "PPRS"), ("กำลัง", "XVBM"), ("ประชุม", "VACT")],
@@ -335,8 +334,13 @@ def test_pos_tag(self):
         self.assertEqual(perceptron.tag(None, corpus="orchid"), [])
         self.assertEqual(perceptron.tag([], corpus="orchid"), [])
 
-        # self.assertIsNotNone(pos_tag(tokens, engine="artagger", corpus="orchid"))
-        # self.assertIsNotNone(pos_tag(tokens, engine="artagger", corpus="pud"))
+        self.assertIsNotNone(pos_tag(None, engine="artagger"))
+        self.assertIsNotNone(pos_tag([], engine="artagger"))
+        self.assertIsNotNone(pos_tag(tokens, engine="artagger"))
+        self.assertEqual(
+            pos_tag(word_tokenize("คุณกำลังประชุม"), engine="artagger"),
+            [("คุณ", "PPRS"), ("กำลัง", "XVBM"), ("ประชุม", "VACT")],
+        )
 
         self.assertEqual(pos_tag_sents(None), [])
         self.assertEqual(pos_tag_sents([]), [])
@@ -502,7 +506,7 @@ def test_romanize(self):
         self.assertIsNotNone(romanize("ทีปกร", engine="royin"))
         self.assertIsNotNone(romanize("กรม", engine="royin"))
         self.assertIsNotNone(romanize("ธรรพ์", engine="royin"))
-        self.assertIsNotNone(romanize("กฏa์", engine="royin"))
+        self.assertIsNotNone(romanize("กฏa์1์ ์", engine="royin"))
         # self.assertIsNotNone(romanize("บัว", engine="thai2rom"))
 
     def test_transliterate(self):