- Considered apostrophes as separate tokens
- Currency of the form Rs. and $ has been taken care of
- Standard email ids, URLs, Hashtags # and mentions @ are also being handled
- Kneser-Ney Smoothing
- Interpolation
- N-grams upto order 6 have been considered
- corpus_EN.txt contains sentences in standard English
- corpus_TW.txt contains assorted tweets
- The language model gets stored in a file named "LM"
- Resembles a Zipf's Distribution for most analytic languages
- Graph for a selected corpus can be constructed
- In the present setting:
- The first graph considers the top-1000 ranked tokens
- The second graph considers 10001 to 11000 ranked words in the corpus
- Enter a test_corpus to generate the perplexity scores for each sentence
- For the comparison of language models, the average perplexity scores across all sentences in the test_corpus is considered
- The maximum N parameter for used N-gram models can be varied