Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
80 commits
Select commit Hold shift + click to select a range
975f91d
Add Translate code
wannaphong Jun 24, 2020
5299b7f
Update core.py
wannaphong Jun 24, 2020
40bd2c8
Update core.py
wannaphong Jun 24, 2020
a0423b6
Update code
wannaphong Jun 24, 2020
fbd1497
Update core.py
wannaphong Jun 24, 2020
dba2a3d
Update core.py
wannaphong Jun 24, 2020
a523430
Update core.py
wannaphong Jun 24, 2020
fdf3f99
Update PEP8
wannaphong Jun 24, 2020
88dade6
Update PEP8
wannaphong Jun 24, 2020
f32420c
Update PEP8
wannaphong Jun 24, 2020
c7d9e76
Update core.py
wannaphong Jun 24, 2020
99ddace
Add requirements
wannaphong Jun 26, 2020
de1cd18
Code formatting
bact Jun 26, 2020
e235e8c
Add translate test
wannaphong Jun 27, 2020
fefa0d3
Update code
wannaphong Jun 27, 2020
e5c22d4
Update model name
wannaphong Jun 27, 2020
e991393
Update core.py
wannaphong Jun 27, 2020
d6ba102
Update path
wannaphong Jun 27, 2020
5707be4
Fixed path bug
wannaphong Jun 27, 2020
a5aaa88
move translate code to core.py
wannaphong Jun 27, 2020
330699b
Fixed test
wannaphong Jun 27, 2020
e798d33
Update pythainlp-test.yml
wannaphong Jun 27, 2020
9b05d6d
del old file
wannaphong Jun 27, 2020
a83077f
Fix PEP8
wannaphong Jun 28, 2020
5ca50c8
Add pythainlp.translate docs
wannaphong Jul 18, 2020
3500451
Update core.py
wannaphong Jul 18, 2020
029ced7
Update core.py
wannaphong Aug 1, 2020
afcf490
Update translate.rst
wannaphong Aug 1, 2020
0d50f94
Update core.py
wannaphong Aug 1, 2020
6389e6b
Update core.py
wannaphong Aug 1, 2020
f7edd66
Update core.py
wannaphong Aug 1, 2020
7b7f821
Fix PEP8
wannaphong Aug 1, 2020
7d409c7
Merge branch 'dev' into add-translate
wannaphong Aug 22, 2020
e7a39b5
Update core.py
wannaphong Aug 22, 2020
caca842
Update core.py
wannaphong Aug 22, 2020
9bd9c06
Update __init__.py
wannaphong Aug 22, 2020
083d416
Add en2th bpe2bep
wannaphong Aug 23, 2020
0599138
Update core.py
wannaphong Aug 23, 2020
6a233fc
Update core.py
wannaphong Aug 23, 2020
0262961
Update core.py
wannaphong Aug 23, 2020
75a173e
Update code
wannaphong Aug 23, 2020
312b22d
Update core.py
wannaphong Aug 23, 2020
a3ca284
Update core.py
wannaphong Aug 23, 2020
c115e4a
Update core.py
wannaphong Aug 23, 2020
1a9c400
Update core.py
wannaphong Aug 23, 2020
02e5186
Add version number to setup,py
bact Aug 23, 2020
960756c
Missing semicolon
bact Sep 7, 2020
3ff13e7
Update core.py
wannaphong Sep 20, 2020
89723e3
fixed th-en
wannaphong Sep 23, 2020
2847924
Update core.py
wannaphong Sep 23, 2020
c20d76d
Update core.py
wannaphong Sep 23, 2020
6a6ccac
Update core.py
wannaphong Sep 23, 2020
16009f5
Remove duplicated test cases
bact Oct 17, 2020
726e42b
Refactor core.py
bact Oct 31, 2020
2f2668f
Update core.py
wannaphong Dec 27, 2020
b229bce
fixed en2th translate
wannaphong Dec 27, 2020
30080d2
Update core.py
wannaphong Dec 27, 2020
e42827a
Update core.py
wannaphong Dec 27, 2020
22ab572
Update core.py
wannaphong Dec 27, 2020
1d82bee
Update core.py
wannaphong Dec 27, 2020
1392b83
Update core.py
wannaphong Dec 27, 2020
e94f8ac
Delete pythainlp-test.yml
wannaphong Dec 27, 2020
7e8d6c9
Update to fairseq>=0.10.0
wannaphong Dec 27, 2020
4efcc25
Update AppVeyor environment to Visual Studio 2019
bact Dec 28, 2020
b834671
Merge branch 'dev' into add-translate
bact Dec 28, 2020
a4eda02
Update appveyor.yml
bact Dec 28, 2020
b362e99
Update fairseq version
bact Dec 28, 2020
8b8cd21
Update appveyor.yml
bact Dec 28, 2020
f7b5947
Update appveyor.yml
wannaphong Dec 28, 2020
6e5da9d
Update appveyor.yml
wannaphong Dec 28, 2020
7202005
Update appveyor.yml
wannaphong Dec 28, 2020
829226f
Update appveyor.yml
wannaphong Dec 28, 2020
701dc76
Update appveyor.yml
wannaphong Dec 28, 2020
ad3f55a
Update appveyor.yml
wannaphong Dec 28, 2020
5213834
Update appveyor.yml
wannaphong Dec 28, 2020
9085187
Update PyTorch
wannaphong Dec 28, 2020
e5d9791
Update appveyor.yml
wannaphong Dec 28, 2020
d0d2119
Update appveyor.yml
wannaphong Dec 28, 2020
4671eae
Close python 3.8 appveyor
wannaphong Dec 28, 2020
346fabc
Remove unused imports
bact Dec 28, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ jobs:
pip install pytest coverage coveralls
if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
pip install "h5py>=2.10.0,<3" "tensorflow>=2.3.1,<3"
pip install torch==1.7.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
pip install torch==1.7.1+cpu -f https://download.pytorch.org/whl/torch_stable.html
pip install deepcut
pip install .[full]
- name: Test
Expand Down
85 changes: 44 additions & 41 deletions appveyor.yml
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
# environment configuration #
#---------------------------------#

image: Visual Studio 2017
image: Visual Studio 2019

# scripts that are called at very beginning, before repo cloning
init:
Expand All @@ -32,17 +32,17 @@ init:
- "ECHO Python %PYTHON_VERSION% (%PYTHON_ARCH%bit) from %PYTHON%"
- ECHO %PYTHONIOENCODING%
- ECHO %ICU_VERSION%
# - ECHO "Installed SDKs:"
# - ps: "ls C:/Python*"
# - ps: "ls \"C:/Program Files (x86)/Microsoft SDKs/Windows\""
- ECHO "Installed SDKs:"
- ps: "ls C:/Python*"
- ps: "ls \"C:/Program Files (x86)/Microsoft SDKs/Windows\""

# fetch repository as zip archive
# https://www.appveyor.com/docs/how-to/repository-shallow-clone/
shallow_clone: true

environment:
global:
APPVEYOR_SAVE_CACHE_ON_ERROR: true
APPVEYOR_SAVE_CACHE_ON_ERROR: false
APPVEYOR_SKIP_FINALIZE_ON_EXIT: true
CMD_IN_ENV: "cmd /E:ON /V:ON /C .\\appveyor\\run_with_env.cmd"
PYTHONIOENCODING: "utf-8"
Expand All @@ -56,7 +56,7 @@ environment:
# PYTHON_ARCH: "32"
# PYICU_PKG: "https://www.dropbox.com/s/pahorbq29y9cura/PyICU-2.3.1-cp36-cp36m-win32.whl?dl=1"

- PYTHON: "C:/Python36-x64"
- PYTHON: "C:\\Miniconda36-x64"
PYTHON_VERSION: "3.6"
PYTHON_ARCH: "64"
PYICU_PKG: "https://www.dropbox.com/s/7t0rrxwckqbgivi/PyICU-2.3.1-cp36-cp36m-win_amd64.whl?dl=1"
Expand All @@ -66,39 +66,42 @@ environment:
# PYTHON_ARCH: "32"
# PYICU_PKG: "https://www.dropbox.com/s/3xwdnwhdcu619x4/PyICU-2.3.1-cp37-cp37m-win32.whl?dl=1"

- PYTHON: "C:/Python37-x64"
PYTHON_VERSION: "3.7"
PYTHON_ARCH: "64"
PYICU_PKG: "https://www.dropbox.com/s/le5dckc3231opqt/PyICU-2.3.1-cp37-cp37m-win_amd64.whl?dl=1"
# - PYTHON: "C:/Python37-x64"
# PYTHON_VERSION: "3.7"
# PYTHON_ARCH: "64"
# PYICU_PKG: "https://www.dropbox.com/s/le5dckc3231opqt/PyICU-2.3.1-cp37-cp37m-win_amd64.whl?dl=1"

# - PYTHON: "C:/Python38-x64"
# - PYTHON: "C:\\Miniconda38-x64"
# PYTHON_VERSION: "3.8"
# PYTHON_ARCH: "64"
# PYICU_PKG: "https://www.dropbox.com/s/o6p2sj5z50iim1e/PyICU-2.3.1-cp38-cp38-win_amd64.whl?dl=0"
# PYICU_PKG: "https://www.dropbox.com/s/o6p2sj5z50iim1e/PyICU-2.3.1-cp38-cp38-win_amd64.whl?dl=1"

matrix:
fast_finish: true

cache:
- "%LOCALAPPDATA%/pip/Cache"
- "%APPDATA%/nltk_data"
#cache:
# - "%LOCALAPPDATA%/pip/Cache"
# - "%APPDATA%/nltk_data"
# - "%LOCALAPPDATA%/pythainlp-data"

install:
- chcp 65001
- "SET PATH=%PYTHON%;%PYTHON%\\Scripts;%PATH%"
# - '"C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\vcvarsall.bat" %PLATFORM%'
- '"C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Auxiliary\Build\vcvarsall.bat" %PLATFORM%'
# - '"C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\vcvarsall.bat" %PLATFORM%'
# - '"C:\Program Files (x86)\Microsoft Visual Studio\2017\Community\VC\Auxiliary\Build\vcvarsall.bat" %PLATFORM%'
- '"C:\Program Files (x86)\Microsoft Visual Studio\2019\Community\VC\Auxiliary\Build\vcvarsall.bat" %PLATFORM%'
- ps: if (-not(Test-Path($env:PYTHON))) { & appveyor\install.ps1 }
- SET PATH=%PYTHON%;%PYTHON%/Scripts;%PATH%
# - ECHO %PATH%
- ECHO %PATH%
- python --version
- python -m pip install --disable-pip-version-check --user --upgrade pip setuptools
- pip --version
- pip install -U "h5py>=2.10.0,<3" "tensorflow>=2.3.1,<3" deepcut
- pip install torch==1.7.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
- pip install %PYICU_PKG%
- pip install -e .[full]
- python -m pip --version
- python -m pip install pyyaml
- python -m pip install -U "h5py>=2.10.0,<3" "tensorflow>=2.3.1,<3" deepcut
- python -m pip install %PYICU_PKG%
- conda install -y -c conda-forge fairseq
- conda remove --force -y pytorch
- python -m pip install torch==1.7.1+cpu -f https://download.pytorch.org/whl/torch_stable.html
- python -m pip install -e .[full]

#---------------------------------#
# build configuration #
Expand All @@ -121,20 +124,20 @@ test_script:
# global handlers #
#---------------------------------#

on_success:
# Remove old or huge cache files to hopefully not exceed the 1GB cache limit.
#
# If the cache limit is reached, the cache will not be updated (of not even
# created in the first run). So this is a trade of between keeping the cache
# current and having a cache at all.
# NB: This is done only `on_success` since the cache in uploaded only on
# success anyway.
# Note: Cygwin is not available on Visual Studio 2019, can try Msys2.
- "ECHO Remove old or huge cache"
- C:\cygwin\bin\find "%LOCALAPPDATA%/pip" -type f -mtime +360 -delete
- C:\cygwin\bin\find "%LOCALAPPDATA%/pip" -type f -size +50M -delete
- C:\cygwin\bin\find "%LOCALAPPDATA%/pip" -empty -delete
# Show size of cache
- C:\cygwin\bin\du -hs "%LOCALAPPDATA%/pip/Cache"
- C:\cygwin\bin\du -hs "%APPDATA%/nltk_data"
- C:\cygwin\bin\du -hs "%LOCALAPPDATA%/pythainlp-data"
#on_success:
# # Remove old or huge cache files to hopefully not exceed the 1GB cache limit.
# #
# # If the cache limit is reached, the cache will not be updated (of not even
# # created in the first run). So this is a trade of between keeping the cache
# # current and having a cache at all.
# # NB: This is done only `on_success` since the cache in uploaded only on
# # success anyway.
# # Note: Cygwin is not available on Visual Studio 2019, can try Msys2.
# - "ECHO Remove old or huge cache"
# - C:\cygwin\bin\find "%LOCALAPPDATA%/pip" -type f -mtime +360 -delete
# - C:\cygwin\bin\find "%LOCALAPPDATA%/pip" -type f -size +50M -delete
# - C:\cygwin\bin\find "%LOCALAPPDATA%/pip" -empty -delete
# # Show size of cache
# - C:\cygwin\bin\du -hs "%LOCALAPPDATA%/pip/Cache"
# - C:\cygwin\bin\du -hs "%APPDATA%/nltk_data"
# - C:\cygwin\bin\du -hs "%LOCALAPPDATA%/pythainlp-data"
10 changes: 10 additions & 0 deletions docs/api/translate.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
.. currentmodule:: pythainlp.translate

pythainlp.translate
===================
The :class:`pythainlp.translate` for language translation.

Modules
-------

.. autofunction:: translate
2 changes: 1 addition & 1 deletion pythainlp/corpus/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -316,7 +316,7 @@ def download(
local_db = TinyDB(corpus_db_path())
query = Query()

corpus = corpus_db[name.lower()]
corpus = corpus_db[name]
print("Corpus:", name)
if version is None:
for v in corpus["versions"]:
Expand Down
11 changes: 11 additions & 0 deletions pythainlp/translate/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# -*- coding: utf-8 -*-
"""
Language translation.
"""

__all__ = [
"translate",
"download_model_all"
]

from pythainlp.translate.core import translate, download_model_all
127 changes: 127 additions & 0 deletions pythainlp/translate/core.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# -*- coding: utf-8 -*-
import os
import tarfile
from collections import defaultdict

from pythainlp.corpus import download, get_corpus_path
from pythainlp.tools import get_full_data_path, get_pythainlp_data_path

from fairseq.models.transformer import TransformerModel
from sacremoses import MosesTokenizer

_en_tokenizer = MosesTokenizer("en")

_model = None
_model_name = None

# SCB_1M-MT_OPUS+TBASE_en-th_moses-spm_130000-16000_v1.0.tar.gz
_EN_TH_FILE_NAME = (
"SCB_1M-MT_OPUS+TBASE_en-th_moses-spm_130000-16000_v1.0"
)
# SCB_1M-MT_OPUS+TBASE_th-en_spm-spm_32000-joined_v1.0.tar.gz
_TH_EN_FILE_NAME = "SCB_1M-MT_OPUS+TBASE_th-en_spm-spm_32000-joined_v1.0"


def _download_install(name):
if get_corpus_path(name) is None:
download(name, force=True, version="1.0")
tar = tarfile.open(get_corpus_path(name), "r:gz")
tar.extractall()
tar.close()
if not os.path.exists(get_full_data_path(name)):
os.mkdir(get_full_data_path(name))
with tarfile.open(get_corpus_path(name)) as tar:
tar.extractall(path=get_full_data_path(name))


def download_model_all() -> None:
"""
Download Model
"""
_download_install("scb_1m_th-en_spm")
_download_install("scb_1m_en-th_moses")


def _get_translate_path(model: str, *path: str) -> str:
return os.path.join(get_full_data_path(model), *path)


def _scb_en_th_model_init():
global _model, _model_name

if _model_name != "scb_1m_en-th_moses":
del _model
_model_name = "scb_1m_en-th_moses"
_download_install(_model_name)
_model = TransformerModel.from_pretrained(
model_name_or_path=_get_translate_path(
_model_name, _EN_TH_FILE_NAME, "models",
),
checkpoint_file="checkpoint.pt",
data_name_or_path=_get_translate_path(
_model_name, _EN_TH_FILE_NAME, "vocab",
),
)


def _scb_en_th_translate(text: str) -> str:
global _model, _model_name

_scb_en_th_model_init()

tokens = " ".join(_en_tokenizer.tokenize(text))
translated = _model.translate(tokens)
return translated.replace(' ', '').replace('▁', ' ').strip()


def _scb_th_en_model_init():
global _model, _model_name

if _model_name != "scb_1m_th-en_spm":
del _model
_model_name = "scb_1m_th-en_spm"
_download_install(_model_name)
_model = TransformerModel.from_pretrained(
model_name_or_path=_get_translate_path(
_model_name, _TH_EN_FILE_NAME, "models",
),
checkpoint_file="checkpoint.pt",
data_name_or_path=_get_translate_path(
_model_name, _TH_EN_FILE_NAME, "vocab",
),
bpe="sentencepiece",
sentencepiece_model=_get_translate_path(
_model_name, _TH_EN_FILE_NAME, "bpe", "spm.th.model",
),
)


def _scb_th_en_translate(text: str) -> str:
global _model, _model_name

_scb_th_en_model_init()

return _model.translate(text)


def translate(text: str, source: str, target: str) -> str:
"""
Translate Language

:param str text: input text in source language
:param str source: source language ("en" or "th")
:param str target: target language ("en" or "th")

:return: translated text in target language
:rtype: str
"""
translated = None

if source == "th" and target == "en":
translated = _scb_th_en_translate(text)
elif source == "en" and target == "th":
translated = _scb_en_th_translate(text)
else:
return ValueError("The combination of the arguments isn't allowed.")

return translated
23 changes: 13 additions & 10 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@

PyThaiNLP is a Python library for Thai natural language processing.
The library provides functions like word tokenization, part-of-speech tagging,
transliteration, soundex generation, and spell checking.
transliteration, soundex generation, spell checking, and
date and time parsing/formatting.

# Install

Expand All @@ -29,13 +30,6 @@

Some functionalities, like named-entity recognition, required extra packages.
See https://github.com/PyThaiNLP/pythainlp for installation options.


Made with ❤️

PyThaiNLP Team

"We build Thai NLP"
"""

requirements = [
Expand All @@ -46,24 +40,33 @@

extras = {
"attacut": ["attacut>=1.0.6"],
"benchmarks": ["numpy>=1.16.1", "pandas>=0.24", "PyYAML>=5.3.1"],
"benchmarks": ["PyYAML>=5.3.1", "numpy>=1.16.1", "pandas>=0.24"],
"icu": ["pyicu>=2.3"],
"ipa": ["epitran>=1.1"],
"ml": ["numpy>=1.16", "torch>=1.0.0"],
"ssg": ["ssg>=0.0.6"],
"thai2fit": ["emoji>=0.5.1", "gensim>=3.2.0", "numpy>=1.16.1"],
"thai2rom": ["torch>=1.0.0", "numpy>=1.16.1"],
"thai2rom": ["numpy>=1.16.1", "torch>=1.0.0"],
"translate": [
"fairseq>=0.10.0",
"sacremoses>=0.0.41",
"sentencepiece>=0.1.91",
"torch>=1.0.0",
],
"wordnet": ["nltk>=3.3.*"],
"full": [
"PyYAML>=5.3.1",
"attacut>=1.0.4",
"emoji>=0.5.1",
"epitran>=1.1",
"fairseq>=0.10.0",
"gensim>=3.2.0",
"nltk>=3.3.*",
"numpy>=1.16.1",
"pandas>=0.24",
"pyicu>=2.3",
"sacremoses>=0.0.41",
"sentencepiece>=0.1.91",
"ssg>=0.0.6",
"torch>=1.0.0",
],
Expand Down
23 changes: 23 additions & 0 deletions tests/test_translate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# -*- coding: utf-8 -*-

import unittest

from pythainlp.translate import translate


class TestTranslatePackage(unittest.TestCase):
def test_translate(self):
self.assertIsNotNone(
translate(
"แมวกินปลา",
source="th",
target="en"
)
)
self.assertIsNotNone(
translate(
"the cat eats fish.",
source="en",
target="th"
)
)