Skip to content

igoeldc/ir_engine

Repository files navigation

Information Retrieval System

A comprehensive Information Retrieval system featuring web crawling, TF-IDF indexing, and a RESTful API with advanced query processing capabilities.

Overview

This project implements a complete IR pipeline:

  1. Web Crawler - Scrapy-based crawler with configurable depth/page limits, AutoThrottle, and URL deduplication
  2. TF-IDF Indexer - Inverted index with cosine similarity ranking
  3. Query Processor - Flask REST API with optional features:
    • Spelling correction (NLTK edit distance)
    • Query expansion (WordNet synonyms)
    • Semantic search (FAISS + Sentence Transformers)

Features

Required Features

  • ✅ Web crawling with seed URL, max pages, and max depth
  • ✅ TF-IDF weighted inverted index
  • ✅ Cosine similarity ranking
  • ✅ REST API with JSON responses
  • ✅ Batch CSV query processing

Optional Features

  • Spelling Correction - Edit distance-based correction using NLTK
  • Query Expansion - Synonym expansion using WordNet
  • Semantic Search - Dense vector search using FAISS and Sentence Transformers
  • AutoThrottle - Polite crawling with automatic rate limiting
  • URL Fragment Handling - Prevents duplicate indexing via URL anchors

Performance

Evaluated on a 200-document Wikipedia corpus:

  • Macro-averaged F1: 0.40
  • Precision: 0.50
  • Recall: 0.33
  • Query Latency: ~2ms (TF-IDF only), ~1.3s (all features)
  • Spelling Accuracy: 87.5% (7/8 test cases)
  • Corpus Size: 200 documents, 22,648 unique terms

Quick Start

# Clone and setup
git clone https://github.com/igoeldc/ir-engine.git
cd ir-engine
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt

# Download NLTK data (for spelling correction)
python -c "import nltk; nltk.download('words')"

# Crawl Wikipedia (small demo)
cd crawler
scrapy crawl seed \
    -a start_url="https://en.wikipedia.org/wiki/Information_retrieval" \
    -a max_pages=50 \
    -a max_depth=2

# Build index
cd ..
python indexer/build_index.py

# Optional: Build semantic index (requires ~90MB model download)
python indexer/semantic_index.py

# Start server
python processor/app.py

Server runs on http://localhost:5001

Installation

Requirements

  • Python 3.12+
  • pip

Dependencies

pip install -r requirements.txt

Core packages:

  • scrapy (2.13+) - Web crawling framework
  • scikit-learn (1.6+) - TF-IDF vectorization and ML utilities
  • flask (3.1+) - REST API server
  • beautifulsoup4 (4.12+) - HTML parsing
  • nltk (3.9+) - Spelling correction

Optional (semantic search):

  • faiss-cpu (1.9+) - Fast vector similarity search
  • sentence-transformers (3.3+) - Document embeddings

All dependencies are open-source (BSD, MIT, Apache 2.0 licenses).

Usage

1. Crawl Documents

cd crawler
scrapy crawl seed \
    -a start_url="https://en.wikipedia.org/wiki/Information_retrieval" \
    -a max_pages=200 \
    -a max_depth=3 \
    --loglevel=INFO

Parameters:

  • start_url - Seed URL to begin crawling
  • max_pages - Maximum number of pages to crawl
  • max_depth - Maximum link depth from seed

Output:

  • data/raw_html/*.html - Downloaded HTML documents (UUID-named)
  • data/raw_html/metadata.json - Document metadata (URL, title, depth)

2. Build Index

# TF-IDF index (required)
python indexer/build_index.py

# Semantic index (optional, ~90MB model download)
python indexer/semantic_index.py

Output:

  • data/index/inverted_index.json - Term → [(doc_id, tfidf_score), ...] mapping
  • data/index/documents.json - Document metadata
  • data/index/semantic_metadata.json - FAISS index (if semantic search enabled)

3. Start Server

python processor/app.py

Server starts on http://localhost:5001

4. Search

Basic search:

curl "http://localhost:5001/search?q=information+retrieval&k=5"

With spelling correction:

curl "http://localhost:5001/search?q=informaton+retreival&correct=true"

With query expansion:

curl "http://localhost:5001/search?q=search+engine&expand=true"

Semantic search:

curl "http://localhost:5001/search?q=how+do+search+engines+work&semantic=true"

Batch CSV search:

curl -X POST "http://localhost:5001/search?k=3" \
     -H "Content-Type: text/csv" \
     --data-binary @queries.csv

API Reference

Endpoints

Endpoint Method Description
/health GET System health check, returns status and index stats
/search GET Single query search
/search POST Batch CSV search (queries.csv format)
/documents GET List all indexed documents
/documents/<id> GET Get specific document details

Query Parameters

Parameter Type Default Description
q string required Query text
k int 5 Number of results to return
correct bool false Enable spelling correction
expand bool false Enable query expansion with synonyms
semantic bool false Use semantic search instead of TF-IDF
max_dist int 2 Maximum edit distance for spelling correction

Response Format

Single query response:

{
  "query": "information retrieval",
  "processed_query": "information retrieval",
  "search_mode": "tfidf",
  "num_results": 3,
  "results": [
    {
      "rank": 1,
      "doc_id": "5266FFE3-A84F-4A50-A832-CE210D9E43FD",
      "title": "Information retrieval - Wikipedia",
      "url": "https://en.wikipedia.org/wiki/Information_retrieval",
      "score": 0.4905
    }
  ]
}

Batch CSV response:

query_id,rank,document_id
UUID-1,1,DOC-ID-1
UUID-1,2,DOC-ID-2
UUID-1,3,DOC-ID-3

Project Structure

.
├── crawler/
│   └── crawler/
│       ├── settings.py              # Scrapy configuration
│       └── spiders/
│           └── seed_spider.py       # Main crawling spider
├── indexer/
│   ├── build_index.py               # TF-IDF index builder
│   └── semantic_index.py            # FAISS semantic index builder
├── processor/
│   └── app.py                       # Flask API server
├── tests/
│   ├── test_search.py               # pytest test suite
│   └── ground_truth.json            # Ground truth for evaluation
├── data/
│   ├── raw_html/                    # Crawled HTML documents
│   │   ├── *.html
│   │   └── metadata.json
│   └── index/                       # Built indices
│       ├── inverted_index.json
│       ├── documents.json
│       └── semantic_metadata.json
├── requirements.txt                 # Python dependencies
├── report.ipynb                     # Project report (Jupyter notebook)
└── README.md                        # This file

Implementation Details

Crawler

  • Framework: Scrapy with AutoThrottle for polite crawling
  • Features:
    • Respects robots.txt
    • URL fragment stripping to prevent duplicate indexing
    • Wikipedia namespace filtering (Category:, File:, etc.)
    • UUID-based document naming
  • Limitations: URL aliases (same content at different URLs) treated as separate documents

Indexer

  • Algorithm: TF-IDF with IDF(t) = log((1 + N) / (1 + df(t))) + 1
  • Vectorizer: Scikit-Learn TfidfVectorizer
  • Features:
    • English stop word removal
    • Min DF = 2 (or 1 for small corpora)
    • Max DF = 0.9
  • Storage: JSON format (human-readable, ~100MB for 200 docs)

Query Processor

  • Ranking: Cosine similarity between query and document vectors
  • Spelling Correction: NLTK edit distance with configurable threshold
  • Query Expansion: WordNet synonym expansion
  • Semantic Search: Sentence-BERT embeddings + FAISS approximate nearest neighbor search

Trade-offs and Design Decisions

Decision Trade-off Rationale
In-Memory Index RAM usage vs speed Acceptable for <1000 docs, enables ~2-3ms query time
JSON Index Format File size vs readability Human-readable for debugging, acceptable for project scale
Fragment Stripping Lose section-specific indexing Ensures diverse corpus, prevents duplicates
No Content Deduplication URL aliases create duplicates Computational cost too high for ~2-3% edge cases
Optional Semantic Search Complexity and latency Flexibility without forcing 50x latency increase
Edit Distance ≤ 2 Miss some misspellings Balances correction recall vs over-correction

Known Limitations

  1. URL Aliases: Wikipedia serves identical content at different URLs (~2-3% of corpus)
  2. Semantic Search Latency: ~50-100ms per query vs ~2ms for TF-IDF
  3. Query Expansion Noise: WordNet synonyms may not fit IR domain (e.g., "engine" → "locomotive")
  4. Spelling Correction: Only corrects terms within edit distance that exist in NLTK corpus
  5. Scale: In-memory index suitable for <1000 documents; larger corpora need database backing

Future Enhancements

  • Content-based deduplication using MinHash or SimHash
  • Filter query expansion synonyms against corpus vocabulary
  • Persistent database backend (Elasticsearch) for production scale
  • Learning-to-Rank (LTR) with feature extraction and model training
  • Rocchio algorithm for relevance feedback and query reformulation
  • Fine-tuned semantic search model for domain-specific queries
  • Query logging and analytics
  • Document snippets with term highlighting
  • BM25 scoring as alternative to TF-IDF

License

MIT License

Copyright (c) 2025 Ishaan Goel

Acknowledgments

  • Built for CS 429 (Information Retrieval) at Illinois Institute of Technology
  • Uses Wikipedia content under Creative Commons Attribution-ShareAlike 3.0 license
  • Powered by open-source libraries: Scrapy, Scikit-Learn, Flask, NLTK, FAISS

Contact

Ishaan Goel GitHub: @igoeldc