Considerations for language model inclusion in default package or download them later

This issue works as a note on the size of different language models PyThaiNLP currently use.

- Some of them are included in the package - will be immediately available after the installation of package.
- Some of them are not included in the package - will be downloaded automatically on the first call, during runtime.

## To include or not include: Pros and cons
- Having language models included in the standard package
  - Pros:
    - Less dependable to the network connection, less predictable behaviour
    - May be easier to manage and cache in continuous integration/testing environment
  - Cons:
    - Larger package size
    - May waste user's disk space with files they never use
- Download language models at the point of its first usage
  - Pros:
    - Smaller package size
    - User only download what they really use
  - Cons:
    - More dependable to the network connection, more predictable behaviour
    - Can slow down test (multiple separated file downloads, in sequence, is slower than one big file download)

## Use pip to download language models

Optionally, we can also consider create a new package, upload them to PyPI, and using pip to facilitate downloads.

User can do something like `pip install pythainlp-models-pos` or `pip install pythainlp-models[ner]` or `pip install pythainlp-models[all]` during their environment setup, and then will never have to worry about them being downloaded during runtime.

This way, we can use PyPI as our data host and also benefit from any possible proxy and cache CI platforms/ISPs may have for PyPI. This can be more secure than our self-manage system as well.

PyPI standard package size limit is 60MB. But more [can be requested](https://github.com/pypa/packaging-problems/issues).

## Size and Hosting

| Model | Filename | Size | Included in package? | Hosting |
|-------|-----------|-----|-----------------------|--------|
| Language model (Thai Wikipedia) | thwiki_lm.pth | 1.0 GB | No | ? |
| Thai word vector | thai2vec.bin | 62.5 MB | No | ? |
| Thai Romanization | thai2rom-pytorch-attn-v0.1.tar | 12.2 MB | No | ? |
| Sentence segmentation (TED) | sentenceseg-ted.model | 5.2MB | **Yes** | - |
| Thai Romanization v2 | thai2rom-v2.hdf5 | 5.1 MB | No | ? |
| Named-Entity Recognition | data.model | 1.8 MB | No | ? |
| Thai Wikipedia (for?) | thwiki_itos.pkl | 1.5 MB | No | ? |
| Thai Romanization | thai2rom-pytorch.tar | 276 KB | No | ? |

(clearly, we need some standard naming convention here as well)

## Training data and training scripts

See #344

## Model card

Related to this, in terms of model description, see #471

## Model auto-download

See discussion about `pythainlp.corpus.get_corpus_path()` at #385
 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Considerations for language model inclusion in default package or download them later #298

To include or not include: Pros and cons

Use pip to download language models

Size and Hosting

Training data and training scripts

Model card

Model auto-download

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	Filename	Size	Included in package?	Hosting
Language model (Thai Wikipedia)	thwiki_lm.pth	1.0 GB	No	?
Thai word vector	thai2vec.bin	62.5 MB	No	?
Thai Romanization	thai2rom-pytorch-attn-v0.1.tar	12.2 MB	No	?
Sentence segmentation (TED)	sentenceseg-ted.model	5.2MB	Yes	-
Thai Romanization v2	thai2rom-v2.hdf5	5.1 MB	No	?
Named-Entity Recognition	data.model	1.8 MB	No	?
Thai Wikipedia (for?)	thwiki_itos.pkl	1.5 MB	No	?
Thai Romanization	thai2rom-pytorch.tar	276 KB	No	?

Considerations for language model inclusion in default package or download them later #298

Description

To include or not include: Pros and cons

Use pip to download language models

Size and Hosting

Training data and training scripts

Model card

Model auto-download

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions