Information Retrieval System

A comprehensive Information Retrieval system featuring web crawling, TF-IDF indexing, and a RESTful API with advanced query processing capabilities.

Overview

This project implements a complete IR pipeline:

Web Crawler - Scrapy-based crawler with configurable depth/page limits, AutoThrottle, and URL deduplication
TF-IDF Indexer - Inverted index with cosine similarity ranking
Query Processor - Flask REST API with optional features:
- Spelling correction (NLTK edit distance)
- Query expansion (WordNet synonyms)
- Semantic search (FAISS + Sentence Transformers)

Features

Required Features

✅ Web crawling with seed URL, max pages, and max depth
✅ TF-IDF weighted inverted index
✅ Cosine similarity ranking
✅ REST API with JSON responses
✅ Batch CSV query processing

Optional Features

✅ Spelling Correction - Edit distance-based correction using NLTK
✅ Query Expansion - Synonym expansion using WordNet
✅ Semantic Search - Dense vector search using FAISS and Sentence Transformers
✅ AutoThrottle - Polite crawling with automatic rate limiting
✅ URL Fragment Handling - Prevents duplicate indexing via URL anchors

Performance

Evaluated on a 200-document Wikipedia corpus:

Macro-averaged F1: 0.40
Precision: 0.50
Recall: 0.33
Query Latency: ~2ms (TF-IDF only), ~1.3s (all features)
Spelling Accuracy: 87.5% (7/8 test cases)
Corpus Size: 200 documents, 22,648 unique terms

Quick Start

# Clone and setup
git clone https://github.com/igoeldc/ir-engine.git
cd ir-engine
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# Download NLTK data (for spelling correction)
python -c "import nltk; nltk.download('words')"

# Crawl Wikipedia (small demo)
cd crawler
scrapy crawl seed \
    -a start_url="https://en.wikipedia.org/wiki/Information_retrieval" \
    -a max_pages=50 \
    -a max_depth=2

# Build index
cd ..
python indexer/build_index.py

# Optional: Build semantic index (requires ~90MB model download)
python indexer/semantic_index.py

# Start server
python processor/app.py

Server runs on http://localhost:5001

Installation

Requirements

Python 3.12+
pip

Dependencies

pip install -r requirements.txt

Core packages:

scrapy (2.13+) - Web crawling framework
scikit-learn (1.6+) - TF-IDF vectorization and ML utilities
flask (3.1+) - REST API server
beautifulsoup4 (4.12+) - HTML parsing
nltk (3.9+) - Spelling correction

Optional (semantic search):

faiss-cpu (1.9+) - Fast vector similarity search
sentence-transformers (3.3+) - Document embeddings

All dependencies are open-source (BSD, MIT, Apache 2.0 licenses).

Usage

1. Crawl Documents

cd crawler
scrapy crawl seed \
    -a start_url="https://en.wikipedia.org/wiki/Information_retrieval" \
    -a max_pages=200 \
    -a max_depth=3 \
    --loglevel=INFO

Parameters:

start_url - Seed URL to begin crawling
max_pages - Maximum number of pages to crawl
max_depth - Maximum link depth from seed

Output:

data/raw_html/*.html - Downloaded HTML documents (UUID-named)
data/raw_html/metadata.json - Document metadata (URL, title, depth)

2. Build Index

# TF-IDF index (required)
python indexer/build_index.py

# Semantic index (optional, ~90MB model download)
python indexer/semantic_index.py

Output:

data/index/inverted_index.json - Term → [(doc_id, tfidf_score), ...] mapping
data/index/documents.json - Document metadata
data/index/semantic_metadata.json - FAISS index (if semantic search enabled)

3. Start Server

python processor/app.py

Server starts on http://localhost:5001

4. Search

Basic search:

curl "http://localhost:5001/search?q=information+retrieval&k=5"

With spelling correction:

curl "http://localhost:5001/search?q=informaton+retreival&correct=true"

With query expansion:

curl "http://localhost:5001/search?q=search+engine&expand=true"

Semantic search:

curl "http://localhost:5001/search?q=how+do+search+engines+work&semantic=true"

Batch CSV search:

curl -X POST "http://localhost:5001/search?k=3" \
     -H "Content-Type: text/csv" \
     --data-binary @queries.csv

API Reference

Endpoints

Endpoint	Method	Description
`/health`	GET	System health check, returns status and index stats
`/search`	GET	Single query search
`/search`	POST	Batch CSV search (queries.csv format)
`/documents`	GET	List all indexed documents
`/documents/<id>`	GET	Get specific document details

Query Parameters

Parameter	Type	Default	Description
`q`	string	required	Query text
`k`	int	5	Number of results to return
`correct`	bool	false	Enable spelling correction
`expand`	bool	false	Enable query expansion with synonyms
`semantic`	bool	false	Use semantic search instead of TF-IDF
`max_dist`	int	2	Maximum edit distance for spelling correction

Response Format

Single query response:

{
  "query": "information retrieval",
  "processed_query": "information retrieval",
  "search_mode": "tfidf",
  "num_results": 3,
  "results": [
    {
      "rank": 1,
      "doc_id": "5266FFE3-A84F-4A50-A832-CE210D9E43FD",
      "title": "Information retrieval - Wikipedia",
      "url": "https://en.wikipedia.org/wiki/Information_retrieval",
      "score": 0.4905
    }
  ]
}

Batch CSV response:

query_id,rank,document_id
UUID-1,1,DOC-ID-1
UUID-1,2,DOC-ID-2
UUID-1,3,DOC-ID-3

Project Structure

.
├── crawler/
│   └── crawler/
│       ├── settings.py              # Scrapy configuration
│       └── spiders/
│           └── seed_spider.py       # Main crawling spider
├── indexer/
│   ├── build_index.py               # TF-IDF index builder
│   └── semantic_index.py            # FAISS semantic index builder
├── processor/
│   └── app.py                       # Flask API server
├── tests/
│   ├── test_search.py               # pytest test suite
│   └── ground_truth.json            # Ground truth for evaluation
├── data/
│   ├── raw_html/                    # Crawled HTML documents
│   │   ├── *.html
│   │   └── metadata.json
│   └── index/                       # Built indices
│       ├── inverted_index.json
│       ├── documents.json
│       └── semantic_metadata.json
├── requirements.txt                 # Python dependencies
├── report.ipynb                     # Project report (Jupyter notebook)
└── README.md                        # This file

Implementation Details

Crawler

Framework: Scrapy with AutoThrottle for polite crawling
Features:
- Respects robots.txt
- URL fragment stripping to prevent duplicate indexing
- Wikipedia namespace filtering (Category:, File:, etc.)
- UUID-based document naming
Limitations: URL aliases (same content at different URLs) treated as separate documents

Indexer

Algorithm: TF-IDF with IDF(t) = log((1 + N) / (1 + df(t))) + 1
Vectorizer: Scikit-Learn TfidfVectorizer
Features:
- English stop word removal
- Min DF = 2 (or 1 for small corpora)
- Max DF = 0.9
Storage: JSON format (human-readable, ~100MB for 200 docs)

Query Processor

Ranking: Cosine similarity between query and document vectors
Spelling Correction: NLTK edit distance with configurable threshold
Query Expansion: WordNet synonym expansion
Semantic Search: Sentence-BERT embeddings + FAISS approximate nearest neighbor search

Trade-offs and Design Decisions

Decision	Trade-off	Rationale
In-Memory Index	RAM usage vs speed	Acceptable for <1000 docs, enables ~2-3ms query time
JSON Index Format	File size vs readability	Human-readable for debugging, acceptable for project scale
Fragment Stripping	Lose section-specific indexing	Ensures diverse corpus, prevents duplicates
No Content Deduplication	URL aliases create duplicates	Computational cost too high for ~2-3% edge cases
Optional Semantic Search	Complexity and latency	Flexibility without forcing 50x latency increase
Edit Distance ≤ 2	Miss some misspellings	Balances correction recall vs over-correction

Known Limitations

URL Aliases: Wikipedia serves identical content at different URLs (~2-3% of corpus)
Semantic Search Latency: ~50-100ms per query vs ~2ms for TF-IDF
Query Expansion Noise: WordNet synonyms may not fit IR domain (e.g., "engine" → "locomotive")
Spelling Correction: Only corrects terms within edit distance that exist in NLTK corpus
Scale: In-memory index suitable for <1000 documents; larger corpora need database backing

Future Enhancements

Content-based deduplication using MinHash or SimHash
Filter query expansion synonyms against corpus vocabulary
Persistent database backend (Elasticsearch) for production scale
Learning-to-Rank (LTR) with feature extraction and model training
Rocchio algorithm for relevance feedback and query reformulation
Fine-tuned semantic search model for domain-specific queries
Query logging and analytics
Document snippets with term highlighting
BM25 scoring as alternative to TF-IDF

License

MIT License

Acknowledgments

Built for CS 429 (Information Retrieval) at Illinois Institute of Technology
Uses Wikipedia content under Creative Commons Attribution-ShareAlike 3.0 license
Powered by open-source libraries: Scrapy, Scikit-Learn, Flask, NLTK, FAISS

Contact

Ishaan Goel GitHub: @igoeldc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Information Retrieval System

Overview

Features

Required Features

Optional Features

Performance

Quick Start

Installation

Requirements

Dependencies

Usage

1. Crawl Documents

2. Build Index

3. Start Server

4. Search

API Reference

Endpoints

Query Parameters

Response Format

Project Structure

Implementation Details

Crawler

Indexer

Query Processor

Trade-offs and Design Decisions

Known Limitations

Future Enhancements

License

Acknowledgments

Contact

About

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
crawler		crawler
data		data
indexer		indexer
processor		processor
tests		tests
.gitignore		.gitignore
README.md		README.md
project_url.txt		project_url.txt
report.ipynb		report.ipynb
report.pdf		report.pdf
requirements.txt		requirements.txt
seed_url.txt		seed_url.txt

igoeldc/ir_engine

Folders and files

Latest commit

History

Repository files navigation

Information Retrieval System

Overview

Features

Required Features

Optional Features

Performance

Quick Start

Installation

Requirements

Dependencies

Usage

1. Crawl Documents

2. Build Index

3. Start Server

4. Search

API Reference

Endpoints

Query Parameters

Response Format

Project Structure

Implementation Details

Crawler

Indexer

Query Processor

Trade-offs and Design Decisions

Known Limitations

Future Enhancements

License

Acknowledgments

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages