Skip to content

Latest commit

 

History

History
 
 

jslm

Adaptive Language Models in JavaScript

This directory contains collection of simple adaptive language models that are cheap enough memory- and processor-wise to train in a browser on the fly.

Language Models

Prediction by Partial Matching (PPM)

Prediction by Partial Matching (PPM) character language model.

Bibliography

  1. Cleary, John G. and Witten, Ian H. (1984): “Data Compression Using Adaptive Coding and Partial String Matching”, IEEE Transactions on Communications, vol. 32, no. 4, pp. 396–402.
  2. Moffat, Alistair (1990): “Implementing the PPM data compression scheme”, IEEE Transactions on Communications, vol. 38, no. 11, pp. 1917–1921.
  3. Ney, Reinhard and Kneser, Hermann (1995): “Improved backing-off for M-gram language modeling”, Proc. of Acoustics, Speech, and Signal Processing (ICASSP), May, pp. 181–184. IEEE.
  4. Chen, Stanley F. and Goodman, Joshua (1999): “An empirical study of smoothing techniques for language modeling”, Computer Speech & Language, vol. 13, no. 4, pp. 359–394, Elsevier.
  5. Ward, David J. and Blackwell, Alan F. and MacKay, David J. C. (2000): “Dasher – A Data Entry Interface Using Continuous Gestures and Language Models”, UIST '00 Proceedings of the 13th annual ACM symposium on User interface software and technology, pp. 129–137, November, San Diego, USA.
  6. Drinic, Milenko and Kirovski, Darko and Potkonjak, Miodrag (2003): “PPM Model Cleaning”, Proc. of Data Compression Conference (DCC'2003), pp. 163–172. March, Snowbird, UT, USA. IEEE
  7. Jin Hu Huang and David Powers (2004): “Adaptive Compression-based Approach for Chinese Pinyin Input”, Proceedings of the Third SIGHAN Workshop on Chinese Language Processing, pp. 24–27, Barcelona, Spain. ACL.
  8. Cowans, Phil (2005): “Language Modelling In Dasher – A Tutorial”, June, Inference Lab, Cambridge University (presentation).
  9. Steinruecken, Christian and Ghahramani, Zoubin and MacKay, David (2016): “Improving PPM with dynamic parameter updates”, Proc. of Data Compression Conference (DCC'2015), pp. 193–202, April, Snowbird, UT, USA. IEEE.
  10. Steinruecken, Christian (2015): “Lossless Data Compression”, PhD dissertation, University of Cambridge.

Histogram Language Model

Very simple context-less histogram character language model.

Bibliography

  1. Steinruecken, Christian (2015): “Lossless Data Compression”, PhD dissertation, University of Cambridge.
  2. Pitman, Jim and Yor, Marc (1997): “The two-parameter Poisson–Dirichlet distribution derived from a stable subordinator.”, The Annals of Probability, vol. 25, no. 2, pp. 855–900.
  3. Stanley F. Chen and Joshua Goodman (1999): “An empirical study of smoothing techniques for language modeling”, Computer Speech and Language, vol. 13, pp. 359–394.

Pólya Tree (PT) Language Model

Context-less predictive distribution based on balanced binary search trees. Tentative implementation is here.

Bibliography

  1. Gleave, Adam and Steinruecken, Christian (2017): “Making compression algorithms for Unicode text”, arXiv preprint arXiv:1701.04047.
  2. Steinruecken, Christian (2015): “Lossless Data Compression”, PhD dissertation, University of Cambridge.
  3. Mauldin, R. Daniel and Sudderth, William D. and Williams, S. C. (1992): “Polya Trees and Random Distributions”, The Annals of Statistics, vol. 20, no. 3, pp. 1203–1221.
  4. Lavine, Michael (1992): “Some aspects of Polya tree distributions for statistical modelling”, The Annals of Statistics, vol. 20, no. 3, pp. 1222–1235.
  5. Neath, Andrew A. (2003): “Polya Tree Distributions for Statistical Modeling of Censored Data”, Journal of Applied Mathematics and Decision Sciences, vol. 7, no. 3, pp. 175–186.

Example

Please see a simple example usage of the model API in example.js.

The example has no command-line arguments. To run it using NodeJS invoke

> node example.js

Test Utility

A simple test driver language_model_driver.js can be used to check that the model behaves using NodeJS. The driver takes three parameters: the maximum order for the language model, the training file and the test file in text format. Currently only the PPM model is supported.

Example:

> node --max-old-space-size=4096 language_model_driver.js 7 training.txt test.txt
Initializing vocabulary from training.txt ...
Created vocabulary with 212 symbols.
Constructing 7-gram LM ...
Created trie with 21502513 nodes.
Running over test.txt ...
Results: numSymbols = 69302, ppl = 6.047012997396163, entropy = 2.5962226799087356 bits/char, OOVs = 0 (0%).