Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 10 additions & 2 deletions pythainlp/tokenize/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -122,17 +122,25 @@ def sent_tokenize(text: str, engine: str = "whitespace+newline") -> List[str]:
def subword_tokenize(text: str, engine: str = "tcc") -> List[str]:
"""
:param str text: text to be tokenized
:param str engine: choosing 'tcc' uses the Thai Character Cluster rule to segment words into the smallest unique units.
:param str engine: subword tokenizer
:Parameters for engine:
* tcc (default) - Thai Character Cluster (Theeramunkong et al. 2000)
* etcc - Enhanced Thai Character Cluster (Inrut et al. 2001) [In development]
:return: a list of tokenized strings.
"""
if not text:
return ""

from .tcc import tcc
from .etcc import etcc

if engine == "tcc":
return tcc(text)
elif engine == "etcc":
return etcc(text).split("/")
#default
return tcc(text)


def syllable_tokenize(text: str) -> List[str]:
"""
:param str text: input string to be tokenized
Expand Down
2 changes: 2 additions & 0 deletions pythainlp/tokenize/etcc.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
โปรแกรม ETCC ใน Python
พัฒนาโดย นาย วรรณพงษ์ ภัททิยไพบูลย์
19 มิ.ย. 2560
Reference: Inrut, Jeeragone, Patiroop Yuanghirun, Sarayut Paludkong, Supot Nitsuwat, and Para Limmaneepraserth. "Thai word segmentation using combination of forward and backward longest matching techniques." In International Symposium on Communications and Information Technology (ISCIT), pp. 37-40. 2001.


วิธีใช้งาน
etcc(คำ)
Expand Down
5 changes: 3 additions & 2 deletions pythainlp/tokenize/tcc.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
# -*- coding: utf-8 -*-
"""
Separate Thai text into Thai Character Cluster (TCC).
Based on "Character cluster based Thai information retrieval" (Theeramunkong et al. 2002)
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.59.2548
Based on "Character cluster based Thai information retrieval" (Theeramunkong et al. 2000)
https://dl.acm.org/citation.cfm?id=355225
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.59.2548

Credits:
- TCC: Jakkrit TeCho
Expand Down