Information-Retrieval-er

This is a basic search engine prototype.

It is implemented using python.

The data is saved in .docx files present in the folder data.

The output is currently limited to only the document names and not the content of the same.

This gives a very high level-basic understanding of how the search engine works.

Description

Building index file
a. Fetch all files in the folder ./data/
b. For all files do

i. Fetch the tokens (all the individual words)  
ii. Remove stop words (This is done based on a predefined stop words list which is included in the zip (filename : stopWords.txt))  
iii. Perform stemming on tokens (performed with the help of library >stemming.)  
iv. Convert all tokens to lower case  
v. Map the tokens to the current document. If the token already has other documents associated with it, add the current document to the list.  
vi. Write tokens to file for each document (filename = document name)

c. Write all the indexes to file maps.txt

Processing input query
a. Fetch all the words/tokens in the search query
b. Perform stemming on the tokens
c. Convert the tokens to lower case

Perform the Search operation
a. Based on the tokens generated from query, fetch all the documents list for all the tokens
b. If an ‘and’ operation is provided, intersect the previous and next document list
c. If an ‘or’ operation is provided, union the previous and next document list

Assumptions
a. Parenthesis are not considered for search query

Minimum requirements to run
a. Python should be installed
b. Packages to be present or installed

   i.	os  
   ii.	sys  
   iii. stemming – pip install stemming  
   iv.	docx – pip install python-docx (python-docx.readthedocs.io/)

Basic setup
a. Documents must be placed in folder “data”
b. stopWords.txt file should be placed in the executing folder
c. Output will be written to the folder “output”
d. Output files written are:

   i.	Tokens generated for all documents (6 files, 1 for each document)  
   ii.	Index generated (maps.txt)  
   iii.	Search result history – All search result query and output are saved (result.txt)

How to run:
a. python IR.py “<<search query goes here – in quotes>>”

This was done as part of second semester assignment from BITS Pilani for M.Tech in Computer Science for the subject of Information Retrieval.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
output		output
.gitignore		.gitignore
Description.docx		Description.docx
Description.md		Description.md
IR.py		IR.py
README.md		README.md
stopWords.txt		stopWords.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Information-Retrieval-er

Description

About

Releases

Packages

Languages

alenjalex-zz/Information-Retrieval-er

Folders and files

Latest commit

History

Repository files navigation

Information-Retrieval-er

Description

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages