This is a basic search engine prototype.
It is implemented using python
.
The data is saved in .docx
files present in the folder data
.
The output is currently limited to only the document names and not the content of the same.
This gives a very high level-basic understanding of how the search engine works.
- Building index file
a. Fetch all files in the folder ./data/
b. For all files do
i. Fetch the tokens (all the individual words) ii. Remove stop words (This is done based on a predefined stop words list which is included in the zip (filename : stopWords.txt)) iii. Perform stemming on tokens (performed with the help of library >stemming.) iv. Convert all tokens to lower case v. Map the tokens to the current document. If the token already has other documents associated with it, add the current document to the list. vi. Write tokens to file for each document (filename = document name)
c. Write all the indexes to file maps.txt
- Processing input query
a. Fetch all the words/tokens in the search query
b. Perform stemming on the tokens
c. Convert the tokens to lower case
- Perform the Search operation
a. Based on the tokens generated from query, fetch all the documents list for all the tokens
b. If an ‘and’ operation is provided, intersect the previous and next document list
c. If an ‘or’ operation is provided, union the previous and next document list
- Assumptions
a. Parenthesis are not considered for search query
- Minimum requirements to run
a. Python should be installed
b. Packages to be present or installed
i. os ii. sys iii. stemming – pip install stemming iv. docx – pip install python-docx (python-docx.readthedocs.io/)
- Basic setup
a. Documents must be placed in folder “data”
b. stopWords.txt file should be placed in the executing folder
c. Output will be written to the folder “output”
d. Output files written are:
i. Tokens generated for all documents (6 files, 1 for each document) ii. Index generated (maps.txt) iii. Search result history – All search result query and output are saved (result.txt)
- How to run:
a. python IR.py “<<search query goes here – in quotes>>”
This was done as part of second semester assignment from BITS Pilani for M.Tech in Computer Science for the subject of Information Retrieval.