Skip to content

PyThaiNLP/pythainlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5,902 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

PyThaiNLP: Thai Natural Language Processing in Python

Project Logo

pypi Python 3.9 License DOI Project Status: Active Codacy Grade Coverage Status Google Colab Badge Facebook Chat on Matrix

pythainlp.org | Tutorials | License info | Model cards | Adopters | เอกสารภาษาไทย

Designed to be a Thai-focused counterpart to NLTK, PyThaiNLP provides standard tools for linguistic analysis under an Apache-2.0 license, with its data and models covered by CC0-1.0 and CC-BY-4.0.

pip install pythainlp
Version Python version Changes Documentation
5.2.0 3.7+ Log pythainlp.org/docs
dev 3.9+ Log pythainlp.org/dev-docs

Features

  • Linguistic units: Sentence, word, and subword segmentation (sent_tokenize, word_tokenize, subword_tokenize).

  • Tagging: Part-of-speech tagging (pos_tag).

  • Transliteration: Romanization (transliterate) and IPA conversion.

  • Correction: Spelling suggestion and correction (spell, correct).

  • Utilities: Soundex, collation, number-to-text (bahttext), datetime formatting (thai_strftime), and keyboard layout correction.

  • Data: Built-in Thai character sets, word lists, and stop words.

  • CLI: Command-line interface via thainlp.

    thainlp data catalog  # List datasets
    thainlp help          # Show usage

Installation options

To install with specific extras (e.g., translate, wordnet, full):

pip install "pythainlp[extra1,extra2,...]"

Possible extras included:

  • compact — install a stable and small subset of dependencies (recommended)
  • translate — machine translation support
  • wordnet — WordNet support
  • full — install all optional dependencies (may introduce conflicts)

The documentation website maintains the full list of extras. To see the specific libraries included in each extra, please inspect the [project.optional-dependencies] section of pyproject.toml.

Data directory

PyThaiNLP downloads data (see the data catalog db.json at pythainlp-corpus) to ~/pythainlp-data by default. Set the PYTHAINLP_DATA_DIR environment variable to override this location.

When using PyThaiNLP in distributed computing environments (e.g., Apache Spark), set the PYTHAINLP_DATA_DIR environment variable inside the function that will be distributed to worker nodes. See details in the documentation.

Testing

We test core functionalities on all officially supported Python versions.

See tests/README.md for test matrix and other details.

Contribute to PyThaiNLP

Please fork and create a pull request. See CONTRIBUTING.md for guidelines and algorithm references.

Citations

If you use PyThaiNLP in your project or publication, please cite the library as follows:

Phatthiyaphaibun, Wannaphong, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, and Pattarawat Chormai. “Pythainlp: Thai Natural Language Processing in Python”. Zenodo, 2 June 2024. http://doi.org/10.5281/zenodo.3519354.

or by BibTeX entry:

@software{pythainlp,
    title = "{P}y{T}hai{NLP}: {T}hai Natural Language Processing in {P}ython",
    author = "Phatthiyaphaibun, Wannaphong  and
      Chaovavanich, Korakot  and
      Polpanumas, Charin  and
      Suriyawongkul, Arthit  and
      Lowphansirikul, Lalita  and
      Chormai, Pattarawat",
    doi = {10.5281/zenodo.3519354},
    license = {Apache-2.0},
    month = jun,
    url = {https://github.com/PyThaiNLP/pythainlp/},
    version = {v5.0.4},
    year = {2024},
}

Our NLP-OSS 2023 paper:

Wannaphong Phatthiyaphaibun, Korakot Chaovavanich, Charin Polpanumas, Arthit Suriyawongkul, Lalita Lowphansirikul, Pattarawat Chormai, Peerat Limkonchotiwat, Thanathip Suntorntip, and Can Udomcharoenchaikit. 2023. PyThaiNLP: Thai Natural Language Processing in Python. In Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023), pages 25–36, Singapore, Singapore. Empirical Methods in Natural Language Processing.

and its BibTeX entry:

@inproceedings{phatthiyaphaibun-etal-2023-pythainlp,
    title = "{P}y{T}hai{NLP}: {T}hai Natural Language Processing in {P}ython",
    author = "Phatthiyaphaibun, Wannaphong  and
      Chaovavanich, Korakot  and
      Polpanumas, Charin  and
      Suriyawongkul, Arthit  and
      Lowphansirikul, Lalita  and
      Chormai, Pattarawat  and
      Limkonchotiwat, Peerat  and
      Suntorntip, Thanathip  and
      Udomcharoenchaikit, Can",
    editor = "Tan, Liling  and
      Milajevs, Dmitrijs  and
      Chauhan, Geeticka  and
      Gwinnup, Jeremy  and
      Rippeth, Elijah",
    booktitle = "Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)",
    month = dec,
    year = "2023",
    address = "Singapore, Singapore",
    publisher = "Empirical Methods in Natural Language Processing",
    url = "https://aclanthology.org/2023.nlposs-1.4",
    pages = "25--36",
    abstract = "We present PyThaiNLP, a free and open-source natural language processing (NLP) library for Thai language implemented in Python. It provides a wide range of software, models, and datasets for Thai language. We first provide a brief historical context of tools for Thai language prior to the development of PyThaiNLP. We then outline the functionalities it provided as well as datasets and pre-trained language models. We later summarize its development milestones and discuss our experience during its development. We conclude by demonstrating how industrial and research communities utilize PyThaiNLP in their work. The library is freely available at https://github.com/pythainlp/pythainlp.",
}

Sponsors

Logo Description
VISTEC-depa Thailand Artificial Intelligence Research Institute Since 2019, our contributors Korakot Chaovavanich and Lalita Lowphansirikul have been supported by VISTEC-depa Thailand Artificial Intelligence Research Institute.
MacStadium We get support of free Mac Mini M1 from MacStadium for running CI builds.

Made with ❤️ | PyThaiNLP Team 💻 | "We build Thai NLP" 🇹🇭

We have only one official repository at https://github.com/PyThaiNLP/pythainlp and another mirror at https://gitlab.com/pythainlp/pythainlp
Beware of malware if you use code from mirrors other than the official two on GitHub and GitLab.