Skip to content

adarsh-7-satyam/search-engine-java

Repository files navigation

🔍 Mini Search Engine in Java

A console-based Mini Search Engine built using Core Java, implementing key concepts from Data Structures, Algorithms, and Information Retrieval systems.

The project supports fast keyword-based search, partial-word queries, and relevance-based ranking using TF-IDF, similar to how real-world search engines work internally.


🚀 Features

  • ✅ Keyword-based document search
  • ✅ Supports partially typed words (prefix search)
  • ✅ Efficient relevance ranking using TF-IDF
  • ✅ Returns Top 5 most relevant documents
  • ✅ Highlights matched keywords in results
  • ✅ Optimized for performance (no combinatorial explosion)
  • ✅ Clean, modular, interview-ready code

🧠 How the Search Engine Works

  1. Documents are indexed using an Inverted Index
  2. User query is tokenized
  3. Partial words are expanded using a Trie
    • Example: ba → backend
  4. Expanded keywords are merged into a single query
  5. TF-IDF scores are computed for each document
  6. Documents are ranked and Top 5 results are returned

⚡ Time Complexity Optimization (Important)

❌ Naive Approach (Not Used)

  • Generate all combinations of prefix-expanded words
  • Leads to exponential time complexity
O(k^p × D × log D)

✅ Optimized Approach (Used in This Project)

  • No combinatorial query generation
  • Single-pass TF-IDF computation using all expanded keywords
  • Uses a Priority Queue to keep only Top 5 results

Final Complexity:

O(T × avgDocs)

Where:

  • T = number of expanded query terms
  • avgDocs = average documents per term

➡️ This makes the engine scalable and production-like.


🧱 Data Structures Used

Data Structure Purpose
HashMap Inverted Index, document storage
Trie Prefix-based word expansion
PriorityQueue Ranking top search results
Set Remove duplicate keywords
List Store documents and results

🧮 Algorithms Used

  • Inverted Indexing
  • TF-IDF (Term Frequency – Inverse Document Frequency)
  • Trie-based Prefix Search
  • Greedy Ranking using Priority Queue
  • String Tokenization & Normalization

🖥️ How to Run This Project Locally

📌 Prerequisites

  • Java JDK 8 or above
  • Git
  • Terminal / Command Prompt / WSL

🔽 Clone the Repository

git clone https://github.com/adarsh-7-satyam/search-engine-java
cd search-engine-java

▶️ Compile the Project

javac *.java
java Main

✍️ Sample Input

jav ba fr

✅ Sample Output

Expanded keywords used for search:

  • java
  • backend
  • framework

Top 5 Search Results:

  1. Spring FRAMEWORK simplifies Java BACKEND development
  2. Spring Boot is a Java FRAMEWORK for BACKEND systems
  3. Java BACKEND developers often use Spring FRAMEWORK ...

🎯 Learning Outcomes

  • Practical understanding of search engine internals

  • Efficient use of DSA in real-world problems

  • Experience with performance optimization

  • Clean separation of search logic and presentation

📌 Future Enhancements

  • Stop-word removal

  • Cosine similarity ranking

  • Spell correction

  • REST API using Spring Boot

  • Web-based UI

👨‍💻 Author

Adarsh Satyam B.Tech CSE, IIT Bhilai

About

Mini Search Engine using Trie, Inverted Index and TF-IDF

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages