A comprehensive Information Retrieval system featuring web crawling, TF-IDF indexing, and a RESTful API with advanced query processing capabilities.
This project implements a complete IR pipeline:
- Web Crawler - Scrapy-based crawler with configurable depth/page limits, AutoThrottle, and URL deduplication
- TF-IDF Indexer - Inverted index with cosine similarity ranking
- Query Processor - Flask REST API with optional features:
- Spelling correction (NLTK edit distance)
- Query expansion (WordNet synonyms)
- Semantic search (FAISS + Sentence Transformers)
- ✅ Web crawling with seed URL, max pages, and max depth
- ✅ TF-IDF weighted inverted index
- ✅ Cosine similarity ranking
- ✅ REST API with JSON responses
- ✅ Batch CSV query processing
- ✅ Spelling Correction - Edit distance-based correction using NLTK
- ✅ Query Expansion - Synonym expansion using WordNet
- ✅ Semantic Search - Dense vector search using FAISS and Sentence Transformers
- ✅ AutoThrottle - Polite crawling with automatic rate limiting
- ✅ URL Fragment Handling - Prevents duplicate indexing via URL anchors
Evaluated on a 200-document Wikipedia corpus:
- Macro-averaged F1: 0.40
- Precision: 0.50
- Recall: 0.33
- Query Latency: ~2ms (TF-IDF only), ~1.3s (all features)
- Spelling Accuracy: 87.5% (7/8 test cases)
- Corpus Size: 200 documents, 22,648 unique terms
# Clone and setup
git clone https://github.com/igoeldc/ir-engine.git
cd ir-engine
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# Download NLTK data (for spelling correction)
python -c "import nltk; nltk.download('words')"
# Crawl Wikipedia (small demo)
cd crawler
scrapy crawl seed \
-a start_url="https://en.wikipedia.org/wiki/Information_retrieval" \
-a max_pages=50 \
-a max_depth=2
# Build index
cd ..
python indexer/build_index.py
# Optional: Build semantic index (requires ~90MB model download)
python indexer/semantic_index.py
# Start server
python processor/app.pyServer runs on http://localhost:5001
- Python 3.12+
- pip
pip install -r requirements.txtCore packages:
scrapy(2.13+) - Web crawling frameworkscikit-learn(1.6+) - TF-IDF vectorization and ML utilitiesflask(3.1+) - REST API serverbeautifulsoup4(4.12+) - HTML parsingnltk(3.9+) - Spelling correction
Optional (semantic search):
faiss-cpu(1.9+) - Fast vector similarity searchsentence-transformers(3.3+) - Document embeddings
All dependencies are open-source (BSD, MIT, Apache 2.0 licenses).
cd crawler
scrapy crawl seed \
-a start_url="https://en.wikipedia.org/wiki/Information_retrieval" \
-a max_pages=200 \
-a max_depth=3 \
--loglevel=INFOParameters:
start_url- Seed URL to begin crawlingmax_pages- Maximum number of pages to crawlmax_depth- Maximum link depth from seed
Output:
data/raw_html/*.html- Downloaded HTML documents (UUID-named)data/raw_html/metadata.json- Document metadata (URL, title, depth)
# TF-IDF index (required)
python indexer/build_index.py
# Semantic index (optional, ~90MB model download)
python indexer/semantic_index.pyOutput:
data/index/inverted_index.json- Term → [(doc_id, tfidf_score), ...] mappingdata/index/documents.json- Document metadatadata/index/semantic_metadata.json- FAISS index (if semantic search enabled)
python processor/app.pyServer starts on http://localhost:5001
Basic search:
curl "http://localhost:5001/search?q=information+retrieval&k=5"With spelling correction:
curl "http://localhost:5001/search?q=informaton+retreival&correct=true"With query expansion:
curl "http://localhost:5001/search?q=search+engine&expand=true"Semantic search:
curl "http://localhost:5001/search?q=how+do+search+engines+work&semantic=true"Batch CSV search:
curl -X POST "http://localhost:5001/search?k=3" \
-H "Content-Type: text/csv" \
--data-binary @queries.csv| Endpoint | Method | Description |
|---|---|---|
/health |
GET | System health check, returns status and index stats |
/search |
GET | Single query search |
/search |
POST | Batch CSV search (queries.csv format) |
/documents |
GET | List all indexed documents |
/documents/<id> |
GET | Get specific document details |
| Parameter | Type | Default | Description |
|---|---|---|---|
q |
string | required | Query text |
k |
int | 5 | Number of results to return |
correct |
bool | false | Enable spelling correction |
expand |
bool | false | Enable query expansion with synonyms |
semantic |
bool | false | Use semantic search instead of TF-IDF |
max_dist |
int | 2 | Maximum edit distance for spelling correction |
Single query response:
{
"query": "information retrieval",
"processed_query": "information retrieval",
"search_mode": "tfidf",
"num_results": 3,
"results": [
{
"rank": 1,
"doc_id": "5266FFE3-A84F-4A50-A832-CE210D9E43FD",
"title": "Information retrieval - Wikipedia",
"url": "https://en.wikipedia.org/wiki/Information_retrieval",
"score": 0.4905
}
]
}Batch CSV response:
query_id,rank,document_id
UUID-1,1,DOC-ID-1
UUID-1,2,DOC-ID-2
UUID-1,3,DOC-ID-3
.
├── crawler/
│ └── crawler/
│ ├── settings.py # Scrapy configuration
│ └── spiders/
│ └── seed_spider.py # Main crawling spider
├── indexer/
│ ├── build_index.py # TF-IDF index builder
│ └── semantic_index.py # FAISS semantic index builder
├── processor/
│ └── app.py # Flask API server
├── tests/
│ ├── test_search.py # pytest test suite
│ └── ground_truth.json # Ground truth for evaluation
├── data/
│ ├── raw_html/ # Crawled HTML documents
│ │ ├── *.html
│ │ └── metadata.json
│ └── index/ # Built indices
│ ├── inverted_index.json
│ ├── documents.json
│ └── semantic_metadata.json
├── requirements.txt # Python dependencies
├── report.ipynb # Project report (Jupyter notebook)
└── README.md # This file
- Framework: Scrapy with AutoThrottle for polite crawling
- Features:
- Respects
robots.txt - URL fragment stripping to prevent duplicate indexing
- Wikipedia namespace filtering (Category:, File:, etc.)
- UUID-based document naming
- Respects
- Limitations: URL aliases (same content at different URLs) treated as separate documents
- Algorithm: TF-IDF with IDF(t) = log((1 + N) / (1 + df(t))) + 1
- Vectorizer: Scikit-Learn TfidfVectorizer
- Features:
- English stop word removal
- Min DF = 2 (or 1 for small corpora)
- Max DF = 0.9
- Storage: JSON format (human-readable, ~100MB for 200 docs)
- Ranking: Cosine similarity between query and document vectors
- Spelling Correction: NLTK edit distance with configurable threshold
- Query Expansion: WordNet synonym expansion
- Semantic Search: Sentence-BERT embeddings + FAISS approximate nearest neighbor search
| Decision | Trade-off | Rationale |
|---|---|---|
| In-Memory Index | RAM usage vs speed | Acceptable for <1000 docs, enables ~2-3ms query time |
| JSON Index Format | File size vs readability | Human-readable for debugging, acceptable for project scale |
| Fragment Stripping | Lose section-specific indexing | Ensures diverse corpus, prevents duplicates |
| No Content Deduplication | URL aliases create duplicates | Computational cost too high for ~2-3% edge cases |
| Optional Semantic Search | Complexity and latency | Flexibility without forcing 50x latency increase |
| Edit Distance ≤ 2 | Miss some misspellings | Balances correction recall vs over-correction |
- URL Aliases: Wikipedia serves identical content at different URLs (~2-3% of corpus)
- Semantic Search Latency: ~50-100ms per query vs ~2ms for TF-IDF
- Query Expansion Noise: WordNet synonyms may not fit IR domain (e.g., "engine" → "locomotive")
- Spelling Correction: Only corrects terms within edit distance that exist in NLTK corpus
- Scale: In-memory index suitable for <1000 documents; larger corpora need database backing
- Content-based deduplication using MinHash or SimHash
- Filter query expansion synonyms against corpus vocabulary
- Persistent database backend (Elasticsearch) for production scale
- Learning-to-Rank (LTR) with feature extraction and model training
- Rocchio algorithm for relevance feedback and query reformulation
- Fine-tuned semantic search model for domain-specific queries
- Query logging and analytics
- Document snippets with term highlighting
- BM25 scoring as alternative to TF-IDF
MIT License
Copyright (c) 2025 Ishaan Goel
- Built for CS 429 (Information Retrieval) at Illinois Institute of Technology
- Uses Wikipedia content under Creative Commons Attribution-ShareAlike 3.0 license
- Powered by open-source libraries: Scrapy, Scikit-Learn, Flask, NLTK, FAISS
Ishaan Goel GitHub: @igoeldc