Skip to content

Language Modelling for various corpora, Natural Language Generation using LMs, Corpus Statistics Visualization

Notifications You must be signed in to change notification settings

sayarghoshroy/Language-Modelling

Repository files navigation

Language Modelling

Open In Colab

Tokenizer implemented using regex-es from scratch

  • Considered apostrophes as separate tokens
  • Currency of the form Rs. and $ has been taken care of
  • Standard email ids, URLs, Hashtags # and mentions @ are also being handled

Implementation of language modelling algorithm

  • Kneser-Ney Smoothing
  • Interpolation
  • N-grams upto order 6 have been considered
  • corpus_EN.txt contains sentences in standard English
  • corpus_TW.txt contains assorted tweets
  • The language model gets stored in a file named "LM"

Visualization of Word Frequency v/s Word Occurence Rank

  • Resembles a Zipf's Distribution for most analytic languages
  • Graph for a selected corpus can be constructed
  • In the present setting:
    1. The first graph considers the top-1000 ranked tokens
    2. The second graph considers 10001 to 11000 ranked words in the corpus

Computation of Model Perplexity Scores

  • Enter a test_corpus to generate the perplexity scores for each sentence
  • For the comparison of language models, the average perplexity scores across all sentences in the test_corpus is considered

Sentence Generation

  • The maximum N parameter for used N-gram models can be varied