Skip to content

Considerations for language model inclusion in default package or download them later #298

Open
@bact

Description

@bact

This issue works as a note on the size of different language models PyThaiNLP currently use.

  • Some of them are included in the package - will be immediately available after the installation of package.
  • Some of them are not included in the package - will be downloaded automatically on the first call, during runtime.

To include or not include: Pros and cons

  • Having language models included in the standard package
    • Pros:
      • Less dependable to the network connection, less predictable behaviour
      • May be easier to manage and cache in continuous integration/testing environment
    • Cons:
      • Larger package size
      • May waste user's disk space with files they never use
  • Download language models at the point of its first usage
    • Pros:
      • Smaller package size
      • User only download what they really use
    • Cons:
      • More dependable to the network connection, more predictable behaviour
      • Can slow down test (multiple separated file downloads, in sequence, is slower than one big file download)

Use pip to download language models

Optionally, we can also consider create a new package, upload them to PyPI, and using pip to facilitate downloads.

User can do something like pip install pythainlp-models-pos or pip install pythainlp-models[ner] or pip install pythainlp-models[all] during their environment setup, and then will never have to worry about them being downloaded during runtime.

This way, we can use PyPI as our data host and also benefit from any possible proxy and cache CI platforms/ISPs may have for PyPI. This can be more secure than our self-manage system as well.

PyPI standard package size limit is 60MB. But more can be requested.

Size and Hosting

Model Filename Size Included in package? Hosting
Language model (Thai Wikipedia) thwiki_lm.pth 1.0 GB No ?
Thai word vector thai2vec.bin 62.5 MB No ?
Thai Romanization thai2rom-pytorch-attn-v0.1.tar 12.2 MB No ?
Sentence segmentation (TED) sentenceseg-ted.model 5.2MB Yes -
Thai Romanization v2 thai2rom-v2.hdf5 5.1 MB No ?
Named-Entity Recognition data.model 1.8 MB No ?
Thai Wikipedia (for?) thwiki_itos.pkl 1.5 MB No ?
Thai Romanization thai2rom-pytorch.tar 276 KB No ?

(clearly, we need some standard naming convention here as well)

Training data and training scripts

See #344

Model card

Related to this, in terms of model description, see #471

Model auto-download

See discussion about pythainlp.corpus.get_corpus_path() at #385

Metadata

Metadata

Assignees

No one assigned

    Labels

    corpuscorpus/dataset-related issues

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions