-
Notifications
You must be signed in to change notification settings - Fork 287
Add pythainlp.translate #439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
80 commits
Select commit
Hold shift + click to select a range
975f91d
Add Translate code
wannaphong 5299b7f
Update core.py
wannaphong 40bd2c8
Update core.py
wannaphong a0423b6
Update code
wannaphong fbd1497
Update core.py
wannaphong dba2a3d
Update core.py
wannaphong a523430
Update core.py
wannaphong fdf3f99
Update PEP8
wannaphong 88dade6
Update PEP8
wannaphong f32420c
Update PEP8
wannaphong c7d9e76
Update core.py
wannaphong 99ddace
Add requirements
wannaphong de1cd18
Code formatting
bact e235e8c
Add translate test
wannaphong fefa0d3
Update code
wannaphong e5c22d4
Update model name
wannaphong e991393
Update core.py
wannaphong d6ba102
Update path
wannaphong 5707be4
Fixed path bug
wannaphong a5aaa88
move translate code to core.py
wannaphong 330699b
Fixed test
wannaphong e798d33
Update pythainlp-test.yml
wannaphong 9b05d6d
del old file
wannaphong a83077f
Fix PEP8
wannaphong 5ca50c8
Add pythainlp.translate docs
wannaphong 3500451
Update core.py
wannaphong 029ced7
Update core.py
wannaphong afcf490
Update translate.rst
wannaphong 0d50f94
Update core.py
wannaphong 6389e6b
Update core.py
wannaphong f7edd66
Update core.py
wannaphong 7b7f821
Fix PEP8
wannaphong 7d409c7
Merge branch 'dev' into add-translate
wannaphong e7a39b5
Update core.py
wannaphong caca842
Update core.py
wannaphong 9bd9c06
Update __init__.py
wannaphong 083d416
Add en2th bpe2bep
wannaphong 0599138
Update core.py
wannaphong 6a233fc
Update core.py
wannaphong 0262961
Update core.py
wannaphong 75a173e
Update code
wannaphong 312b22d
Update core.py
wannaphong a3ca284
Update core.py
wannaphong c115e4a
Update core.py
wannaphong 1a9c400
Update core.py
wannaphong 02e5186
Add version number to setup,py
bact 960756c
Missing semicolon
bact 3ff13e7
Update core.py
wannaphong 89723e3
fixed th-en
wannaphong 2847924
Update core.py
wannaphong c20d76d
Update core.py
wannaphong 6a6ccac
Update core.py
wannaphong 16009f5
Remove duplicated test cases
bact 726e42b
Refactor core.py
bact 2f2668f
Update core.py
wannaphong b229bce
fixed en2th translate
wannaphong 30080d2
Update core.py
wannaphong e42827a
Update core.py
wannaphong 22ab572
Update core.py
wannaphong 1d82bee
Update core.py
wannaphong 1392b83
Update core.py
wannaphong e94f8ac
Delete pythainlp-test.yml
wannaphong 7e8d6c9
Update to fairseq>=0.10.0
wannaphong 4efcc25
Update AppVeyor environment to Visual Studio 2019
bact b834671
Merge branch 'dev' into add-translate
bact a4eda02
Update appveyor.yml
bact b362e99
Update fairseq version
bact 8b8cd21
Update appveyor.yml
bact f7b5947
Update appveyor.yml
wannaphong 6e5da9d
Update appveyor.yml
wannaphong 7202005
Update appveyor.yml
wannaphong 829226f
Update appveyor.yml
wannaphong 701dc76
Update appveyor.yml
wannaphong ad3f55a
Update appveyor.yml
wannaphong 5213834
Update appveyor.yml
wannaphong 9085187
Update PyTorch
wannaphong e5d9791
Update appveyor.yml
wannaphong d0d2119
Update appveyor.yml
wannaphong 4671eae
Close python 3.8 appveyor
wannaphong 346fabc
Remove unused imports
bact File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,10 @@ | ||
| .. currentmodule:: pythainlp.translate | ||
|
|
||
| pythainlp.translate | ||
| =================== | ||
| The :class:`pythainlp.translate` for language translation. | ||
|
|
||
| Modules | ||
| ------- | ||
|
|
||
| .. autofunction:: translate |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| # -*- coding: utf-8 -*- | ||
| """ | ||
| Language translation. | ||
| """ | ||
|
|
||
| __all__ = [ | ||
| "translate", | ||
| "download_model_all" | ||
| ] | ||
|
|
||
| from pythainlp.translate.core import translate, download_model_all |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,127 @@ | ||
| # -*- coding: utf-8 -*- | ||
| import os | ||
| import tarfile | ||
| from collections import defaultdict | ||
|
|
||
| from pythainlp.corpus import download, get_corpus_path | ||
| from pythainlp.tools import get_full_data_path, get_pythainlp_data_path | ||
|
|
||
| from fairseq.models.transformer import TransformerModel | ||
| from sacremoses import MosesTokenizer | ||
|
|
||
| _en_tokenizer = MosesTokenizer("en") | ||
|
|
||
| _model = None | ||
| _model_name = None | ||
|
|
||
| # SCB_1M-MT_OPUS+TBASE_en-th_moses-spm_130000-16000_v1.0.tar.gz | ||
| _EN_TH_FILE_NAME = ( | ||
| "SCB_1M-MT_OPUS+TBASE_en-th_moses-spm_130000-16000_v1.0" | ||
| ) | ||
| # SCB_1M-MT_OPUS+TBASE_th-en_spm-spm_32000-joined_v1.0.tar.gz | ||
| _TH_EN_FILE_NAME = "SCB_1M-MT_OPUS+TBASE_th-en_spm-spm_32000-joined_v1.0" | ||
|
|
||
|
|
||
| def _download_install(name): | ||
| if get_corpus_path(name) is None: | ||
| download(name, force=True, version="1.0") | ||
| tar = tarfile.open(get_corpus_path(name), "r:gz") | ||
| tar.extractall() | ||
| tar.close() | ||
| if not os.path.exists(get_full_data_path(name)): | ||
| os.mkdir(get_full_data_path(name)) | ||
| with tarfile.open(get_corpus_path(name)) as tar: | ||
| tar.extractall(path=get_full_data_path(name)) | ||
|
|
||
|
|
||
| def download_model_all() -> None: | ||
| """ | ||
| Download Model | ||
| """ | ||
| _download_install("scb_1m_th-en_spm") | ||
| _download_install("scb_1m_en-th_moses") | ||
|
|
||
|
|
||
| def _get_translate_path(model: str, *path: str) -> str: | ||
| return os.path.join(get_full_data_path(model), *path) | ||
|
|
||
|
|
||
| def _scb_en_th_model_init(): | ||
| global _model, _model_name | ||
|
|
||
| if _model_name != "scb_1m_en-th_moses": | ||
| del _model | ||
| _model_name = "scb_1m_en-th_moses" | ||
| _download_install(_model_name) | ||
| _model = TransformerModel.from_pretrained( | ||
| model_name_or_path=_get_translate_path( | ||
| _model_name, _EN_TH_FILE_NAME, "models", | ||
| ), | ||
| checkpoint_file="checkpoint.pt", | ||
| data_name_or_path=_get_translate_path( | ||
| _model_name, _EN_TH_FILE_NAME, "vocab", | ||
| ), | ||
| ) | ||
|
|
||
|
|
||
| def _scb_en_th_translate(text: str) -> str: | ||
| global _model, _model_name | ||
|
|
||
| _scb_en_th_model_init() | ||
|
|
||
| tokens = " ".join(_en_tokenizer.tokenize(text)) | ||
| translated = _model.translate(tokens) | ||
| return translated.replace(' ', '').replace('▁', ' ').strip() | ||
|
|
||
|
|
||
| def _scb_th_en_model_init(): | ||
| global _model, _model_name | ||
|
|
||
| if _model_name != "scb_1m_th-en_spm": | ||
| del _model | ||
| _model_name = "scb_1m_th-en_spm" | ||
| _download_install(_model_name) | ||
| _model = TransformerModel.from_pretrained( | ||
| model_name_or_path=_get_translate_path( | ||
| _model_name, _TH_EN_FILE_NAME, "models", | ||
| ), | ||
| checkpoint_file="checkpoint.pt", | ||
| data_name_or_path=_get_translate_path( | ||
| _model_name, _TH_EN_FILE_NAME, "vocab", | ||
| ), | ||
| bpe="sentencepiece", | ||
| sentencepiece_model=_get_translate_path( | ||
| _model_name, _TH_EN_FILE_NAME, "bpe", "spm.th.model", | ||
| ), | ||
| ) | ||
|
|
||
|
|
||
| def _scb_th_en_translate(text: str) -> str: | ||
| global _model, _model_name | ||
|
|
||
| _scb_th_en_model_init() | ||
|
|
||
| return _model.translate(text) | ||
|
|
||
|
|
||
| def translate(text: str, source: str, target: str) -> str: | ||
| """ | ||
| Translate Language | ||
|
|
||
| :param str text: input text in source language | ||
| :param str source: source language ("en" or "th") | ||
| :param str target: target language ("en" or "th") | ||
|
|
||
| :return: translated text in target language | ||
| :rtype: str | ||
| """ | ||
| translated = None | ||
|
|
||
| if source == "th" and target == "en": | ||
| translated = _scb_th_en_translate(text) | ||
| elif source == "en" and target == "th": | ||
| translated = _scb_en_th_translate(text) | ||
| else: | ||
| return ValueError("The combination of the arguments isn't allowed.") | ||
|
|
||
| return translated | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,23 @@ | ||
| # -*- coding: utf-8 -*- | ||
|
|
||
| import unittest | ||
|
|
||
| from pythainlp.translate import translate | ||
|
|
||
|
|
||
| class TestTranslatePackage(unittest.TestCase): | ||
| def test_translate(self): | ||
| self.assertIsNotNone( | ||
| translate( | ||
| "แมวกินปลา", | ||
| source="th", | ||
| target="en" | ||
| ) | ||
| ) | ||
| self.assertIsNotNone( | ||
| translate( | ||
| "the cat eats fish.", | ||
| source="en", | ||
| target="th" | ||
| ) | ||
| ) |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.