Sherlook Search Engine 🔎

Sherlook Search Engine is a fast, efficient search engine designed to crawl, index, and rank web pages while providing smart query suggestions and a responsive web interface.

Overview 🚀

The project is divided into several modules, each providing a critical function:

Web Crawler 🤖
Indexer 📚
Query Processor 🔍
Phrase Searching 📝
Boolean Operators Support 🔀
Ranker 📊
Web Interface 💻

Modules Description

Web Crawler 🤖

Functionality:
The crawler starts with a seed set of URLs, downloads HTML documents, and extracts hyperlinks recursively.
Key Requirements:
- Ensure each page is visited only once by normalizing URLs.
- Only crawl specific document types (HTML).
- Maintain state to resume crawling without revisiting pages.
- Respect web administrators' exclusions (using Robots.txt).
- Offer a multithreaded implementation with customizable thread counts.

Indexer 📚

Functionality:
Indexes downloaded HTML documents to map words (with their importance in titles, headers, or body) to corresponding documents.
Key Requirements:
- Persistence: The index is stored in the database.
- Fast retrieval of documents when queried based on specific words.
- Support incremental updates with newly crawled content.
Performance:
- Processes approximately 6000 documents in less than 2 minutes.

Query Processor 🔍

Functionality:
Handles user search queries by preprocessing and finding relevant documents based on word stemming. For example, the query “travel” matches variants like “traveler” and “traveling.”

Phrase Searching 📝

Functionality:
Supports quoted phrase searching to return only pages containing the exact word order. For instance, searching for "football player" returns only those pages with the exact phrase.

Boolean Operators Support 🔀

Supports Boolean operators (AND/OR/NOT) with a maximum of two operations per query, e.g., "Football player" OR "Tennis player"

Ranker 📊

Functionality:
Ranks search results based on relevance and page popularity.
Relevance:
Calculated using factors such as tf-idf or appearance in titles/headers.
Popularity:
Measured using algorithms like PageRank, independent of the query.
Performance:
- First hit rendered in 20–50 ms
- Subsequent hits in less than 5 ms

Web Interface 💻

Functionality:
Provides an interactive search interface that:
- Displays results similar to Google/Bing (with title, URL, and snippet with bolded query words).
- Shows query processing time.
- Implements pagination (e.g., 200 results over 20 pages).
- Offers interactive query suggestions based on popular completions.

Screenshots 📸

Homepage:
Search Results:

Build Instructions ⚙️

Ensure Maven is installed for compiling the backend.
Compile and format the backend:
```
mvn spotless:apply && mvn clean install -DskipTests
```
The build produces a jar file at: sherlook-1.0-SNAPSHOT.jar

How to Run the Backend 🏃‍♂️

Execute the following commands in order:

Crawl Websites:

java -jar target/sherlook-1.0-SNAPSHOT.jar crawl

Index the Crawled Data:

java -jar target/sherlook-1.0-SNAPSHOT.jar index

Run the PageRank Algorithm:

java -jar target/sherlook-1.0-SNAPSHOT.jar page-rank

Serve the Engine:

java -jar target/sherlook-1.0-SNAPSHOT.jar serve

How to Run the Client 💻

Navigate to the Client Directory:
```
cd client
```
Create the Environment File:
```
cp .env.example .env
```
Install Dependencies:
```
npm install
```
Start the Development Server:
```
npm run dev
```
Open Your Browser:

Navigate to http://localhost:5173 (or the host printed in the terminal).

Running with Docker 🐳

If you prefer to use Docker to manage services, follow these instructions:

Build and Run Services

# Build and start all services
docker-compose up --build

# Run in detached mode
docker-compose up -d

# Build and start only the client
docker-compose up --build client

# Build and start only the backend
docker-compose up --build app

Running Different Modes

# Run the crawler in a container
docker-compose run --rm app crawl

# Run the indexer in a container
docker-compose run --rm app index

# Calculate page ranks in a container
docker-compose run --rm app page-rank

# Run the backend server
docker-compose up app

# Run the frontend client
docker-compose up client

Notes ✍️

Ensure that required configuration files (e.g., application.properties) are correctly set.
Both the backend and client need to be running for a complete search experience.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github/workflows		.github/workflows
client		client
data		data
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
checkstyle.xml		checkstyle.xml
docker-compose.yml		docker-compose.yml
pagerank_implementation.pdf		pagerank_implementation.pdf
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sherlook Search Engine 🔎

Overview 🚀

Modules Description

Web Crawler 🤖

Indexer 📚

Query Processor 🔍

Phrase Searching 📝

Boolean Operators Support 🔀

Ranker 📊

Web Interface 💻

Screenshots 📸

Build Instructions ⚙️

How to Run the Backend 🏃‍♂️

How to Run the Client 💻

Running with Docker 🐳

Build and Run Services

Running Different Modes

Notes ✍️

About

Uh oh!

Contributors 4

Uh oh!

Languages

AhmedSobhy01/sher-look

Folders and files

Latest commit

History

Repository files navigation

Sherlook Search Engine 🔎

Overview 🚀

Modules Description

Web Crawler 🤖

Indexer 📚

Query Processor 🔍

Phrase Searching 📝

Boolean Operators Support 🔀

Ranker 📊

Web Interface 💻

Screenshots 📸

Build Instructions ⚙️

How to Run the Backend 🏃‍♂️

How to Run the Client 💻

Running with Docker 🐳

Build and Run Services

Running Different Modes

Notes ✍️

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors 4

Uh oh!

Languages