Working on the 800,000 news files dataset, an information retrieval system. Among the thousands of files obtaining information is a very tedious task if one has to go through each and every word from every file. This can be solved using an efficient information retrieval system. Using several techniques like removing stop words, punctuations, lower case and stemming the data was first pre-processed and cleaned for use.
A posting index was created on this data. With the word being the key which maps to a list. The first element of the list being the count of the word, second being another dictionary with each file it occurs in as the key with the word positions in the file as the values.
Using this posting index created, boolean retrieval was performed on the data.
Positional retrieval was performed.
Wild cary query.
Using the posting index created before a bi-word index was made and used for bi word query retrieval.
Retrieval using Similarity Index with Vector Space Model
Likelihood Model using Bayes theorem
Assigned tf-idf scores based on the input.
Obtained 0.9735667696532784 (97.35%) precision.
Relevance Feedback and reranking of results.