This project is an implementation of a search engine for the UCI Information Retrieval course. The project consists of three main components:
- Indexer: Builds an inverted index from the provided HTML documents.
- Search Component: Allows users to perform searches using Boolean queries and ranks the results using tf-idf scoring.
- Search Report: Provides a report of search results for specific queries.
- Prerequisites:
- Python 3.x
- Required libraries: BeautifulSoup4, nltk
- Project Structure:
- indexer.py: Contains the Indexer class responsible for building the inverted index.
- search.py: Contains the search functionality and helper functions.
- main.py: The main entry point to run the indexing and searching processes.
- data/: Directory to store the index files and other related data.
- DEV/: Directory containing the HTML files in JSON format.
- Running the Code:
- Create DEV directory for your JSON files repository in the current directory.
- To build the index and perform searches, run the main.py file with command line: python3 main.py
- Qizhi Tian
- James Liu
- Sijie Guo