Skip to content

thmsgo18/Information-Retrieval-Engine

Repository files navigation

Information Retrieval Engine

Master IAD S2 — Data Science Project Team Rocket · Université Paris · 2025–2026

A full information retrieval system built on 216,000 Stack Exchange documents across five technical domains. The pipeline combines classical lexical retrieval (TF-IDF, BM25+) with dense semantic embeddings and a learned category classifier to maximise recall on an open Kaggle competition.

Kaggle Python


Table of Contents


Results

Phase 1 — Retrieval Baseline

Method Recall@100
TF-IDF
BM25+
Embeddings (all-MiniLM-L12-v2)
Hybrid BM25+ + Embeddings (RRF) best

Submitted configuration: rrf_k=20, weight_embeddings=3.5, weight_bm25=0.5

Phase 2 — Classification + Reranking

Score Classifier Reranking RETRIEVAL_K rrf_k
0.450 LogReg soft_boost 200 20
0.469 SVC hard_filter 500 60
0.400 SVC hard_filter + cross-encoder 1000 60
0.495 SVC hard_filter 1000 30

Best Kaggle score: 0.49541 using TF-IDF + LinearSVC (macro-F1 = 0.9852 on validation) with hard-filter reranking and a pool of 1,000 candidates per query.


Report

The project report is included directly in this repository and can be accessed here:


Project Structure

.
├── src/
│   ├── config.py                   # Runtime path resolution, Phase1Config, Phase2Config
│   ├── data/
│   │   ├── load.py                 # load_all(), load_json()
│   │   └── preprocess.py           # add_content_field(), clean_text()
│   ├── retrieval/
│   │   ├── tfidf.py                # TF-IDF retrieval
│   │   ├── bm25.py                 # BM25+ retrieval (tech-aware tokenizer)
│   │   ├── embeddings.py           # SentenceTransformers + SHA-256 cache
│   │   └── hybrid.py               # RRF fusion of BM25 and embeddings
│   ├── classification/
│   │   ├── interfaces.py           # ClassifierProtocol, CATEGORIES
│   │   ├── features.py             # TF-IDF / count / embedding feature builders
│   │   ├── model.py                # Unified Classifier wrapper (SVC, LogReg, MLP, NB)
│   │   └── rerank.py               # hard_filter, soft_boost, build_docs_index
│   ├── evaluation/
│   │   ├── metrics.py              # Precision@k, Recall@k, MRR@k, macro-F1, accuracy
│   │   └── evaluate.py             # evaluate_docs_vs_queries, evaluate_with_classification
│   └── kaggle/
│       ├── format.py               # make_submission(), save_submission() — Phase 1 & 2
│       ├── submit_prep.py          # Phase 1 CLI submission pipeline
│       └── submit_phase2.py        # Phase 2 submission pipeline
│
├── notebooks/
│   ├── 01_data_exploration.ipynb
│   ├── 02_tfidf_retrieval.ipynb
│   ├── 03_bm25_retrieval.ipynb
│   ├── 04_embeddings_retrieval.ipynb
│   ├── 05_evaluation_suite.ipynb
│   ├── 06_phase1_submission.ipynb              # Standalone Phase 1 submission
│   ├── 07_phase2_classification_ablation.ipynb # 11-combination ablation study
│   ├── 08_phase2_pipeline_eval.ipynb           # End-to-end Phase 2 evaluation
│   └── 09_phase2_submission.ipynb              # Standalone Phase 2 submission (Kaggle-ready)
│
├── data/
│   ├── raw/        # docs.json, queries_train.json, queries_test.json, qgts_train.json
│   ├── processed/  # docs_with_content.json, queries_*_with_content.json
│   └── cache/      # Embedding arrays (.npy), keyed by SHA-256 content fingerprint
│
├── outputs/
│   ├── runs/           # Evaluation CSVs and Phase 2 run JSON logs
│   └── submissions/    # Kaggle submission CSVs
│
├── requirements.txt
├── Project_Report-Data_Science-Phase_2.pdf
└── Sujet_Project_2026.pdf

Dataset

  • 216,041 documents from Stack Exchange: id, title, text, tags, category
  • 5 categories: android, gaming, programmers, tex, unix
  • Train queries: ground truth available in qgts_train.json (~9.1 relevant docs/query on average)
  • Test queries: 141 queries without ground truth (Kaggle evaluation)
  • content field: concatenation of title + text + tags, built by add_content_field()

Architecture

Phase 1 — Hybrid Retrieval

Query
  ├─► BM25+ retrieval  ─────────────────────┐
  │   (tech-aware tokenizer: c++, c#, .net)  │
  │                                           ├─► RRF fusion ─► Top-100 docs
  └─► Dense embedding retrieval ─────────────┘
      (all-MiniLM-L12-v2, cached .npy)

RRF score: score(d) = w_emb / (k + rank_emb(d)) + w_bm25 / (k + rank_bm25(d))

Phase 2 — Classification + Reranking

Query
  ├─► TF-IDF features ─► LinearSVC ─► predicted_category
  │                       (macro-F1 = 0.9852)
  │
  ├─► Query expansion  (category-specific terms)
  ├─► PRF              (BM25 top-3 docs → extract tags → append)
  │
  ├─► Hybrid retrieval (RETRIEVAL_K = 1000 candidates)
  │
  └─► Hard-filter reranking
      (category-matching docs moved to front → Top-100)

Classifier Ablation (notebook 07)

Features Classifier Macro-F1
TF-IDF LinearSVC 0.9852
TF-IDF LogisticRegression 0.9794
Count LinearSVC 0.9793
Embeddings MLP 0.9695

Setup

1. Clone

git clone https://github.com/thmsgo18/Information-Retrieval-Engine.git
cd Information-Retrieval-Engine

2. Virtual environment

python3 -m venv venv
source venv/bin/activate        # macOS / Linux
# venv\Scripts\activate         # Windows

3. Install dependencies

pip install -r requirements.txt

4. Download the competition data

Configure your Kaggle API token first (one-time):

# 1. Go to https://www.kaggle.com/settings → API → Create New Token
# 2. Move the downloaded kaggle.json to:
mkdir -p ~/.kaggle && mv ~/Downloads/kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json

Then download the data:

cd data/raw
kaggle competitions download -c retrieval-engine-competition
unzip retrieval-engine-competition.zip && rm retrieval-engine-competition.zip
cd ../..

Expected files in data/raw/:

docs.json
queries_train.json
queries_test.json
qgts_train.json

Running the Pipelines

Before running any pipeline, move to the repository root and activate the virtual environment:

cd Information-Retrieval-Engine
source venv/bin/activate

Phase 1 — CLI

Run the Phase 1 submission pipeline directly from src/:

# Full run (production config)
python3 -m src.kaggle.submit_prep

# Fast local test (subset + smaller model)
python3 -m src.kaggle.submit_prep \
  --max-docs 20000 \
  --embedding-model-name all-MiniLM-L6-v2 \
  --embedding-precision int8

Output: outputs/submissions/submission.csv

Phase 1 — Notebook

Launch Jupyter from the project root:

jupyter lab

Then open notebooks/06_phase1_submission.ipynb and run all cells. This notebook is the standalone notebook version of the final Phase 1 hybrid submission pipeline.

Expected output: outputs/submissions/submission.csv

Phase 2 — CLI

Run the Phase 2 end-to-end pipeline from src/:

# Default configuration
python3 -m src.kaggle.submit_phase2

# Example with explicit options
python3 -m src.kaggle.submit_phase2 \
  --classifier svc \
  --feature-method tfidf \
  --rerank hard_filter \
  --embedding-model all-MiniLM-L12-v2

Default outputs:

  • outputs/submissions/submission.csv
  • outputs/runs/phase2/run_YYYYMMDD_HHMMSS.json

Phase 2 — Notebook

Launch Jupyter from the project root:

jupyter lab

Then open notebooks/09_phase2_submission.ipynb and run all cells. It is fully self-contained and auto-detects the execution environment:

Environment Data input Output
Local data/raw/ outputs/submissions/submission_phase2.csv
Kaggle /kaggle/input/*/ /kaggle/working/submission.csv
Colab Files next to notebook submission.csv

Dependencies are auto-installed on Kaggle and Colab:

# Handled automatically in the first cell — no manual action required

Key configuration (cell 10):

CLASSIFIER_METHOD        = 'svc'
FEATURE_METHOD           = 'tfidf'
RERANK_STRATEGY          = 'hard_filter'
TOP_K                    = 100
EMBEDDING_MODEL_NAME     = 'all-MiniLM-L12-v2'
HYBRID_RRF_K             = 30
HYBRID_WEIGHT_EMBEDDINGS = 3.5
HYBRID_WEIGHT_BM25       = 0.5
RETRIEVAL_K              = TOP_K * 10   # 1000 candidates

Notebooks

Notebook Phase Description
01_data_exploration 1 Dataset statistics, EDA, content field construction
02_tfidf_retrieval 1 TF-IDF vectorization, cosine retrieval, evaluation
03_bm25_retrieval 1 BM25 / BM25+ lexical retrieval, comparison with TF-IDF
04_embeddings_retrieval 1 SentenceTransformer embeddings, UMAP/t-SNE visualization, semantic retrieval
05_evaluation_suite 1 Unified comparison across all methods (Precision@k, Recall@k, MRR@k)
06_phase1_submission 1 Standalone hybrid retrieval pipeline — produces Kaggle CSV
07_phase2_classification_ablation 2 11-combination ablation (3 feature types × 4 classifiers)
08_phase2_pipeline_eval 2 End-to-end Phase 2 pipeline evaluation on train queries
09_phase2_submission 2 Standalone Phase 2 pipeline — Kaggle/Colab/Local compatible

Recommended execution order: 01 → 02 → 03 → 04 → 05 → 06 for Phase 1, then 07 → 08 → 09 for Phase 2.


Technologies

Library Role
numpy / scipy Numerical operations, sparse matrices, similarity computation
pandas Result tables, evaluation summaries, submission formatting
scikit-learn TF-IDF, cosine similarity, LinearSVC, LogisticRegression, MLP, train/val/test splits
rank-bm25 BM25 / BM25+ lexical retrieval
sentence-transformers Dense document and query embeddings
torch / transformers Embedding model backend
umap-learn 2D projection of embedding spaces (notebook 04)
tqdm Progress bars throughout the pipelines
kaggle Competition data download and submission CLI

Python: 3.10+ · Key versions: numpy==2.3, scikit-learn==1.8, sentence-transformers==5.2, torch==2.10


Authors

Name GitHub
Thomas Gourmelen @thmsgo18
Maria Aydin @Mmajora53
Vincent Tan @20centan
Clara Ait Mokhtar @claraait123

Master IAD S2 — Université Paris — 2025/2026

About

Information retrieval engine combining TF-IDF, BM25+, embeddings, and category-based reranking on Stack Exchange documents.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors