Sherlook Search Engine is a fast, efficient search engine designed to crawl, index, and rank web pages while providing smart query suggestions and a responsive web interface.
The project is divided into several modules, each providing a critical function:
- Web Crawler 🤖
- Indexer 📚
- Query Processor 🔍
- Phrase Searching 📝
- Boolean Operators Support 🔀
- Ranker 📊
- Web Interface 💻
- Functionality:
The crawler starts with a seed set of URLs, downloads HTML documents, and extracts hyperlinks recursively. - Key Requirements:
- Ensure each page is visited only once by normalizing URLs.
- Only crawl specific document types (HTML).
- Maintain state to resume crawling without revisiting pages.
- Respect web administrators' exclusions (using Robots.txt).
- Offer a multithreaded implementation with customizable thread counts.
- Functionality:
Indexes downloaded HTML documents to map words (with their importance in titles, headers, or body) to corresponding documents. - Key Requirements:
- Persistence: The index is stored in the database.
- Fast retrieval of documents when queried based on specific words.
- Support incremental updates with newly crawled content.
- Performance:
- Processes approximately 6000 documents in less than 2 minutes.
- Functionality:
Handles user search queries by preprocessing and finding relevant documents based on word stemming. For example, the query “travel” matches variants like “traveler” and “traveling.”
- Functionality:
Supports quoted phrase searching to return only pages containing the exact word order. For instance, searching for"football player"
returns only those pages with the exact phrase.
- Supports Boolean operators (AND/OR/NOT) with a maximum of two operations per query, e.g.,
"Football player" OR "Tennis player"
- Functionality:
Ranks search results based on relevance and page popularity. - Relevance:
Calculated using factors such as tf-idf or appearance in titles/headers. - Popularity:
Measured using algorithms like PageRank, independent of the query. - Performance:
- First hit rendered in 20–50 ms
- Subsequent hits in less than 5 ms
- Functionality:
Provides an interactive search interface that:- Displays results similar to Google/Bing (with title, URL, and snippet with bolded query words).
- Shows query processing time.
- Implements pagination (e.g., 200 results over 20 pages).
- Offers interactive query suggestions based on popular completions.
-
Ensure Maven is installed for compiling the backend.
-
Compile and format the backend:
mvn spotless:apply && mvn clean install -DskipTests
The build produces a jar file at: sherlook-1.0-SNAPSHOT.jar
Execute the following commands in order:
-
Crawl Websites:
java -jar target/sherlook-1.0-SNAPSHOT.jar crawl
-
Index the Crawled Data:
java -jar target/sherlook-1.0-SNAPSHOT.jar index
-
Run the PageRank Algorithm:
java -jar target/sherlook-1.0-SNAPSHOT.jar page-rank
-
Serve the Engine:
java -jar target/sherlook-1.0-SNAPSHOT.jar serve
-
Navigate to the Client Directory:
cd client
-
Create the Environment File:
cp .env.example .env
-
Install Dependencies:
npm install
-
Start the Development Server:
npm run dev
-
Open Your Browser:
Navigate to http://localhost:5173 (or the host printed in the terminal).
If you prefer to use Docker to manage services, follow these instructions:
# Build and start all services
docker-compose up --build
# Run in detached mode
docker-compose up -d
# Build and start only the client
docker-compose up --build client
# Build and start only the backend
docker-compose up --build app
# Run the crawler in a container
docker-compose run --rm app crawl
# Run the indexer in a container
docker-compose run --rm app index
# Calculate page ranks in a container
docker-compose run --rm app page-rank
# Run the backend server
docker-compose up app
# Run the frontend client
docker-compose up client
- Ensure that required configuration files (e.g.,
application.properties
) are correctly set. - Both the backend and client need to be running for a complete search experience.