Description
This issue works as a note on the size of different language models PyThaiNLP currently use.
- Some of them are included in the package - will be immediately available after the installation of package.
- Some of them are not included in the package - will be downloaded automatically on the first call, during runtime.
To include or not include: Pros and cons
- Having language models included in the standard package
- Pros:
- Less dependable to the network connection, less predictable behaviour
- May be easier to manage and cache in continuous integration/testing environment
- Cons:
- Larger package size
- May waste user's disk space with files they never use
- Pros:
- Download language models at the point of its first usage
- Pros:
- Smaller package size
- User only download what they really use
- Cons:
- More dependable to the network connection, more predictable behaviour
- Can slow down test (multiple separated file downloads, in sequence, is slower than one big file download)
- Pros:
Use pip to download language models
Optionally, we can also consider create a new package, upload them to PyPI, and using pip to facilitate downloads.
User can do something like pip install pythainlp-models-pos
or pip install pythainlp-models[ner]
or pip install pythainlp-models[all]
during their environment setup, and then will never have to worry about them being downloaded during runtime.
This way, we can use PyPI as our data host and also benefit from any possible proxy and cache CI platforms/ISPs may have for PyPI. This can be more secure than our self-manage system as well.
PyPI standard package size limit is 60MB. But more can be requested.
Size and Hosting
Model | Filename | Size | Included in package? | Hosting |
---|---|---|---|---|
Language model (Thai Wikipedia) | thwiki_lm.pth | 1.0 GB | No | ? |
Thai word vector | thai2vec.bin | 62.5 MB | No | ? |
Thai Romanization | thai2rom-pytorch-attn-v0.1.tar | 12.2 MB | No | ? |
Sentence segmentation (TED) | sentenceseg-ted.model | 5.2MB | Yes | - |
Thai Romanization v2 | thai2rom-v2.hdf5 | 5.1 MB | No | ? |
Named-Entity Recognition | data.model | 1.8 MB | No | ? |
Thai Wikipedia (for?) | thwiki_itos.pkl | 1.5 MB | No | ? |
Thai Romanization | thai2rom-pytorch.tar | 276 KB | No | ? |
(clearly, we need some standard naming convention here as well)
Training data and training scripts
See #344
Model card
Related to this, in terms of model description, see #471
Model auto-download
See discussion about pythainlp.corpus.get_corpus_path()
at #385