Information Retrieval Engine

Master IAD S2 — Data Science Project Team Rocket · Université Paris · 2025–2026

A full information retrieval system built on 216,000 Stack Exchange documents across five technical domains. The pipeline combines classical lexical retrieval (TF-IDF, BM25+) with dense semantic embeddings and a learned category classifier to maximise recall on an open Kaggle competition.

Results

Phase 1 — Retrieval Baseline

Method	Recall@100
TF-IDF	—
BM25+	—
Embeddings (`all-MiniLM-L12-v2`)	—
Hybrid BM25+ + Embeddings (RRF)	best ✅

Submitted configuration: rrf_k=20, weight_embeddings=3.5, weight_bm25=0.5

Phase 2 — Classification + Reranking

Score	Classifier	Reranking	RETRIEVAL_K	rrf_k
0.450	LogReg	soft_boost	200	20
0.469	SVC	hard_filter	500	60
0.400	SVC	hard_filter + cross-encoder	1000	60
0.495	SVC	hard_filter	1000	30

Best Kaggle score: 0.49541 using TF-IDF + LinearSVC (macro-F1 = 0.9852 on validation) with hard-filter reranking and a pool of 1,000 candidates per query.

Report

The project report is included directly in this repository and can be accessed here:

Project report (PDF)

Project Structure

.
├── src/
│   ├── config.py                   # Runtime path resolution, Phase1Config, Phase2Config
│   ├── data/
│   │   ├── load.py                 # load_all(), load_json()
│   │   └── preprocess.py           # add_content_field(), clean_text()
│   ├── retrieval/
│   │   ├── tfidf.py                # TF-IDF retrieval
│   │   ├── bm25.py                 # BM25+ retrieval (tech-aware tokenizer)
│   │   ├── embeddings.py           # SentenceTransformers + SHA-256 cache
│   │   └── hybrid.py               # RRF fusion of BM25 and embeddings
│   ├── classification/
│   │   ├── interfaces.py           # ClassifierProtocol, CATEGORIES
│   │   ├── features.py             # TF-IDF / count / embedding feature builders
│   │   ├── model.py                # Unified Classifier wrapper (SVC, LogReg, MLP, NB)
│   │   └── rerank.py               # hard_filter, soft_boost, build_docs_index
│   ├── evaluation/
│   │   ├── metrics.py              # Precision@k, Recall@k, MRR@k, macro-F1, accuracy
│   │   └── evaluate.py             # evaluate_docs_vs_queries, evaluate_with_classification
│   └── kaggle/
│       ├── format.py               # make_submission(), save_submission() — Phase 1 & 2
│       ├── submit_prep.py          # Phase 1 CLI submission pipeline
│       └── submit_phase2.py        # Phase 2 submission pipeline
│
├── notebooks/
│   ├── 01_data_exploration.ipynb
│   ├── 02_tfidf_retrieval.ipynb
│   ├── 03_bm25_retrieval.ipynb
│   ├── 04_embeddings_retrieval.ipynb
│   ├── 05_evaluation_suite.ipynb
│   ├── 06_phase1_submission.ipynb              # Standalone Phase 1 submission
│   ├── 07_phase2_classification_ablation.ipynb # 11-combination ablation study
│   ├── 08_phase2_pipeline_eval.ipynb           # End-to-end Phase 2 evaluation
│   └── 09_phase2_submission.ipynb              # Standalone Phase 2 submission (Kaggle-ready)
│
├── data/
│   ├── raw/        # docs.json, queries_train.json, queries_test.json, qgts_train.json
│   ├── processed/  # docs_with_content.json, queries_*_with_content.json
│   └── cache/      # Embedding arrays (.npy), keyed by SHA-256 content fingerprint
│
├── outputs/
│   ├── runs/           # Evaluation CSVs and Phase 2 run JSON logs
│   └── submissions/    # Kaggle submission CSVs
│
├── requirements.txt
├── Project_Report-Data_Science-Phase_2.pdf
└── Sujet_Project_2026.pdf

Dataset

216,041 documents from Stack Exchange: id, title, text, tags, category
5 categories: android, gaming, programmers, tex, unix
Train queries: ground truth available in qgts_train.json (~9.1 relevant docs/query on average)
Test queries: 141 queries without ground truth (Kaggle evaluation)
content field: concatenation of title + text + tags, built by add_content_field()

Architecture

Phase 1 — Hybrid Retrieval

Query
  ├─► BM25+ retrieval  ─────────────────────┐
  │   (tech-aware tokenizer: c++, c#, .net)  │
  │                                           ├─► RRF fusion ─► Top-100 docs
  └─► Dense embedding retrieval ─────────────┘
      (all-MiniLM-L12-v2, cached .npy)

RRF score: score(d) = w_emb / (k + rank_emb(d)) + w_bm25 / (k + rank_bm25(d))

Phase 2 — Classification + Reranking

Query
  ├─► TF-IDF features ─► LinearSVC ─► predicted_category
  │                       (macro-F1 = 0.9852)
  │
  ├─► Query expansion  (category-specific terms)
  ├─► PRF              (BM25 top-3 docs → extract tags → append)
  │
  ├─► Hybrid retrieval (RETRIEVAL_K = 1000 candidates)
  │
  └─► Hard-filter reranking
      (category-matching docs moved to front → Top-100)

Classifier Ablation (notebook 07)

Features	Classifier	Macro-F1
TF-IDF	LinearSVC	0.9852 ✅
TF-IDF	LogisticRegression	0.9794
Count	LinearSVC	0.9793
Embeddings	MLP	0.9695

Setup

1. Clone

git clone https://github.com/thmsgo18/Information-Retrieval-Engine.git
cd Information-Retrieval-Engine

2. Virtual environment

python3 -m venv venv
source venv/bin/activate        # macOS / Linux
# venv\Scripts\activate         # Windows

3. Install dependencies

pip install -r requirements.txt

4. Download the competition data

Configure your Kaggle API token first (one-time):

# 1. Go to https://www.kaggle.com/settings → API → Create New Token
# 2. Move the downloaded kaggle.json to:
mkdir -p ~/.kaggle && mv ~/Downloads/kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json

Then download the data:

cd data/raw
kaggle competitions download -c retrieval-engine-competition
unzip retrieval-engine-competition.zip && rm retrieval-engine-competition.zip
cd ../..

Expected files in data/raw/:

docs.json
queries_train.json
queries_test.json
qgts_train.json

Running the Pipelines

Before running any pipeline, move to the repository root and activate the virtual environment:

cd Information-Retrieval-Engine
source venv/bin/activate

Phase 1 — CLI

Run the Phase 1 submission pipeline directly from src/:

# Full run (production config)
python3 -m src.kaggle.submit_prep

# Fast local test (subset + smaller model)
python3 -m src.kaggle.submit_prep \
  --max-docs 20000 \
  --embedding-model-name all-MiniLM-L6-v2 \
  --embedding-precision int8

Output: outputs/submissions/submission.csv

Phase 1 — Notebook

Launch Jupyter from the project root:

jupyter lab

Then open notebooks/06_phase1_submission.ipynb and run all cells. This notebook is the standalone notebook version of the final Phase 1 hybrid submission pipeline.

Expected output: outputs/submissions/submission.csv

Phase 2 — CLI

Run the Phase 2 end-to-end pipeline from src/:

# Default configuration
python3 -m src.kaggle.submit_phase2

# Example with explicit options
python3 -m src.kaggle.submit_phase2 \
  --classifier svc \
  --feature-method tfidf \
  --rerank hard_filter \
  --embedding-model all-MiniLM-L12-v2

Default outputs:

outputs/submissions/submission.csv
outputs/runs/phase2/run_YYYYMMDD_HHMMSS.json

Phase 2 — Notebook

Launch Jupyter from the project root:

jupyter lab

Then open notebooks/09_phase2_submission.ipynb and run all cells. It is fully self-contained and auto-detects the execution environment:

Environment	Data input	Output
Local	`data/raw/`	`outputs/submissions/submission_phase2.csv`
Kaggle	`/kaggle/input/*/`	`/kaggle/working/submission.csv`
Colab	Files next to notebook	`submission.csv`

Dependencies are auto-installed on Kaggle and Colab:

# Handled automatically in the first cell — no manual action required

Key configuration (cell 10):

CLASSIFIER_METHOD        = 'svc'
FEATURE_METHOD           = 'tfidf'
RERANK_STRATEGY          = 'hard_filter'
TOP_K                    = 100
EMBEDDING_MODEL_NAME     = 'all-MiniLM-L12-v2'
HYBRID_RRF_K             = 30
HYBRID_WEIGHT_EMBEDDINGS = 3.5
HYBRID_WEIGHT_BM25       = 0.5
RETRIEVAL_K              = TOP_K * 10   # 1000 candidates

Notebooks

Notebook	Phase	Description
`01_data_exploration`	1	Dataset statistics, EDA, `content` field construction
`02_tfidf_retrieval`	1	TF-IDF vectorization, cosine retrieval, evaluation
`03_bm25_retrieval`	1	BM25 / BM25+ lexical retrieval, comparison with TF-IDF
`04_embeddings_retrieval`	1	SentenceTransformer embeddings, UMAP/t-SNE visualization, semantic retrieval
`05_evaluation_suite`	1	Unified comparison across all methods (Precision@k, Recall@k, MRR@k)
`06_phase1_submission`	1	Standalone hybrid retrieval pipeline — produces Kaggle CSV
`07_phase2_classification_ablation`	2	11-combination ablation (3 feature types × 4 classifiers)
`08_phase2_pipeline_eval`	2	End-to-end Phase 2 pipeline evaluation on train queries
`09_phase2_submission`	2	Standalone Phase 2 pipeline — Kaggle/Colab/Local compatible

Recommended execution order: 01 → 02 → 03 → 04 → 05 → 06 for Phase 1, then 07 → 08 → 09 for Phase 2.

Technologies

Library	Role
`numpy` / `scipy`	Numerical operations, sparse matrices, similarity computation
`pandas`	Result tables, evaluation summaries, submission formatting
`scikit-learn`	TF-IDF, cosine similarity, LinearSVC, LogisticRegression, MLP, train/val/test splits
`rank-bm25`	BM25 / BM25+ lexical retrieval
`sentence-transformers`	Dense document and query embeddings
`torch` / `transformers`	Embedding model backend
`umap-learn`	2D projection of embedding spaces (notebook 04)
`tqdm`	Progress bars throughout the pipelines
`kaggle`	Competition data download and submission CLI

Python: 3.10+ · Key versions: numpy==2.3, scikit-learn==1.8, sentence-transformers==5.2, torch==2.10

Authors

Name	GitHub
Thomas Gourmelen	@thmsgo18
Maria Aydin	@Mmajora53
Vincent Tan	@20centan
Clara Ait Mokhtar	@claraait123

Master IAD S2 — Université Paris — 2025/2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Information Retrieval Engine

Table of Contents

Results

Phase 1 — Retrieval Baseline

Phase 2 — Classification + Reranking

Report

Project Structure

Dataset

Architecture

Phase 1 — Hybrid Retrieval

Phase 2 — Classification + Reranking

Classifier Ablation (notebook 07)

Setup

1. Clone

2. Virtual environment

3. Install dependencies

4. Download the competition data

Running the Pipelines

Phase 1 — CLI

Phase 1 — Notebook

Phase 2 — CLI

Phase 2 — Notebook

Notebooks

Technologies

Authors

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 126 Commits
data		data
notebooks		notebooks
outputs		outputs
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
Project_Report-Data_Science-Phase_2.pdf		Project_Report-Data_Science-Phase_2.pdf
README.md		README.md
Sujet_Project_2026.pdf		Sujet_Project_2026.pdf
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Information Retrieval Engine

Table of Contents

Results

Phase 1 — Retrieval Baseline

Phase 2 — Classification + Reranking

Report

Project Structure

Dataset

Architecture

Phase 1 — Hybrid Retrieval

Phase 2 — Classification + Reranking

Classifier Ablation (notebook 07)

Setup

1. Clone

2. Virtual environment

3. Install dependencies

4. Download the competition data

Running the Pipelines

Phase 1 — CLI

Phase 1 — Notebook

Phase 2 — CLI

Phase 2 — Notebook

Notebooks

Technologies

Authors

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages