Master IAD S2 — Data Science Project Team Rocket · Université Paris · 2025–2026
A full information retrieval system built on 216,000 Stack Exchange documents across five technical domains. The pipeline combines classical lexical retrieval (TF-IDF, BM25+) with dense semantic embeddings and a learned category classifier to maximise recall on an open Kaggle competition.
- Results
- Report
- Project Structure
- Dataset
- Architecture
- Setup
- Running the Pipelines
- Notebooks
- Technologies
- Authors
| Method | Recall@100 |
|---|---|
| TF-IDF | — |
| BM25+ | — |
Embeddings (all-MiniLM-L12-v2) |
— |
| Hybrid BM25+ + Embeddings (RRF) | best ✅ |
Submitted configuration: rrf_k=20, weight_embeddings=3.5, weight_bm25=0.5
| Score | Classifier | Reranking | RETRIEVAL_K | rrf_k |
|---|---|---|---|---|
| 0.450 | LogReg | soft_boost | 200 | 20 |
| 0.469 | SVC | hard_filter | 500 | 60 |
| 0.400 | SVC | hard_filter + cross-encoder | 1000 | 60 |
| 0.495 | SVC | hard_filter | 1000 | 30 |
Best Kaggle score: 0.49541 using TF-IDF + LinearSVC (macro-F1 = 0.9852 on validation) with hard-filter reranking and a pool of 1,000 candidates per query.
The project report is included directly in this repository and can be accessed here:
.
├── src/
│ ├── config.py # Runtime path resolution, Phase1Config, Phase2Config
│ ├── data/
│ │ ├── load.py # load_all(), load_json()
│ │ └── preprocess.py # add_content_field(), clean_text()
│ ├── retrieval/
│ │ ├── tfidf.py # TF-IDF retrieval
│ │ ├── bm25.py # BM25+ retrieval (tech-aware tokenizer)
│ │ ├── embeddings.py # SentenceTransformers + SHA-256 cache
│ │ └── hybrid.py # RRF fusion of BM25 and embeddings
│ ├── classification/
│ │ ├── interfaces.py # ClassifierProtocol, CATEGORIES
│ │ ├── features.py # TF-IDF / count / embedding feature builders
│ │ ├── model.py # Unified Classifier wrapper (SVC, LogReg, MLP, NB)
│ │ └── rerank.py # hard_filter, soft_boost, build_docs_index
│ ├── evaluation/
│ │ ├── metrics.py # Precision@k, Recall@k, MRR@k, macro-F1, accuracy
│ │ └── evaluate.py # evaluate_docs_vs_queries, evaluate_with_classification
│ └── kaggle/
│ ├── format.py # make_submission(), save_submission() — Phase 1 & 2
│ ├── submit_prep.py # Phase 1 CLI submission pipeline
│ └── submit_phase2.py # Phase 2 submission pipeline
│
├── notebooks/
│ ├── 01_data_exploration.ipynb
│ ├── 02_tfidf_retrieval.ipynb
│ ├── 03_bm25_retrieval.ipynb
│ ├── 04_embeddings_retrieval.ipynb
│ ├── 05_evaluation_suite.ipynb
│ ├── 06_phase1_submission.ipynb # Standalone Phase 1 submission
│ ├── 07_phase2_classification_ablation.ipynb # 11-combination ablation study
│ ├── 08_phase2_pipeline_eval.ipynb # End-to-end Phase 2 evaluation
│ └── 09_phase2_submission.ipynb # Standalone Phase 2 submission (Kaggle-ready)
│
├── data/
│ ├── raw/ # docs.json, queries_train.json, queries_test.json, qgts_train.json
│ ├── processed/ # docs_with_content.json, queries_*_with_content.json
│ └── cache/ # Embedding arrays (.npy), keyed by SHA-256 content fingerprint
│
├── outputs/
│ ├── runs/ # Evaluation CSVs and Phase 2 run JSON logs
│ └── submissions/ # Kaggle submission CSVs
│
├── requirements.txt
├── Project_Report-Data_Science-Phase_2.pdf
└── Sujet_Project_2026.pdf
- 216,041 documents from Stack Exchange:
id,title,text,tags,category - 5 categories:
android,gaming,programmers,tex,unix - Train queries: ground truth available in
qgts_train.json(~9.1 relevant docs/query on average) - Test queries: 141 queries without ground truth (Kaggle evaluation)
contentfield: concatenation oftitle + text + tags, built byadd_content_field()
Query
├─► BM25+ retrieval ─────────────────────┐
│ (tech-aware tokenizer: c++, c#, .net) │
│ ├─► RRF fusion ─► Top-100 docs
└─► Dense embedding retrieval ─────────────┘
(all-MiniLM-L12-v2, cached .npy)
RRF score: score(d) = w_emb / (k + rank_emb(d)) + w_bm25 / (k + rank_bm25(d))
Query
├─► TF-IDF features ─► LinearSVC ─► predicted_category
│ (macro-F1 = 0.9852)
│
├─► Query expansion (category-specific terms)
├─► PRF (BM25 top-3 docs → extract tags → append)
│
├─► Hybrid retrieval (RETRIEVAL_K = 1000 candidates)
│
└─► Hard-filter reranking
(category-matching docs moved to front → Top-100)
| Features | Classifier | Macro-F1 |
|---|---|---|
| TF-IDF | LinearSVC | 0.9852 ✅ |
| TF-IDF | LogisticRegression | 0.9794 |
| Count | LinearSVC | 0.9793 |
| Embeddings | MLP | 0.9695 |
git clone https://github.com/thmsgo18/Information-Retrieval-Engine.git
cd Information-Retrieval-Enginepython3 -m venv venv
source venv/bin/activate # macOS / Linux
# venv\Scripts\activate # Windowspip install -r requirements.txtConfigure your Kaggle API token first (one-time):
# 1. Go to https://www.kaggle.com/settings → API → Create New Token
# 2. Move the downloaded kaggle.json to:
mkdir -p ~/.kaggle && mv ~/Downloads/kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.jsonThen download the data:
cd data/raw
kaggle competitions download -c retrieval-engine-competition
unzip retrieval-engine-competition.zip && rm retrieval-engine-competition.zip
cd ../..Expected files in data/raw/:
docs.json
queries_train.json
queries_test.json
qgts_train.json
Before running any pipeline, move to the repository root and activate the virtual environment:
cd Information-Retrieval-Engine
source venv/bin/activateRun the Phase 1 submission pipeline directly from src/:
# Full run (production config)
python3 -m src.kaggle.submit_prep
# Fast local test (subset + smaller model)
python3 -m src.kaggle.submit_prep \
--max-docs 20000 \
--embedding-model-name all-MiniLM-L6-v2 \
--embedding-precision int8Output: outputs/submissions/submission.csv
Launch Jupyter from the project root:
jupyter labThen open notebooks/06_phase1_submission.ipynb and run all cells. This notebook is the standalone notebook version of the final Phase 1 hybrid submission pipeline.
Expected output: outputs/submissions/submission.csv
Run the Phase 2 end-to-end pipeline from src/:
# Default configuration
python3 -m src.kaggle.submit_phase2
# Example with explicit options
python3 -m src.kaggle.submit_phase2 \
--classifier svc \
--feature-method tfidf \
--rerank hard_filter \
--embedding-model all-MiniLM-L12-v2Default outputs:
outputs/submissions/submission.csvoutputs/runs/phase2/run_YYYYMMDD_HHMMSS.json
Launch Jupyter from the project root:
jupyter labThen open notebooks/09_phase2_submission.ipynb and run all cells. It is fully self-contained and auto-detects the execution environment:
| Environment | Data input | Output |
|---|---|---|
| Local | data/raw/ |
outputs/submissions/submission_phase2.csv |
| Kaggle | /kaggle/input/*/ |
/kaggle/working/submission.csv |
| Colab | Files next to notebook | submission.csv |
Dependencies are auto-installed on Kaggle and Colab:
# Handled automatically in the first cell — no manual action requiredKey configuration (cell 10):
CLASSIFIER_METHOD = 'svc'
FEATURE_METHOD = 'tfidf'
RERANK_STRATEGY = 'hard_filter'
TOP_K = 100
EMBEDDING_MODEL_NAME = 'all-MiniLM-L12-v2'
HYBRID_RRF_K = 30
HYBRID_WEIGHT_EMBEDDINGS = 3.5
HYBRID_WEIGHT_BM25 = 0.5
RETRIEVAL_K = TOP_K * 10 # 1000 candidates| Notebook | Phase | Description |
|---|---|---|
01_data_exploration |
1 | Dataset statistics, EDA, content field construction |
02_tfidf_retrieval |
1 | TF-IDF vectorization, cosine retrieval, evaluation |
03_bm25_retrieval |
1 | BM25 / BM25+ lexical retrieval, comparison with TF-IDF |
04_embeddings_retrieval |
1 | SentenceTransformer embeddings, UMAP/t-SNE visualization, semantic retrieval |
05_evaluation_suite |
1 | Unified comparison across all methods (Precision@k, Recall@k, MRR@k) |
06_phase1_submission |
1 | Standalone hybrid retrieval pipeline — produces Kaggle CSV |
07_phase2_classification_ablation |
2 | 11-combination ablation (3 feature types × 4 classifiers) |
08_phase2_pipeline_eval |
2 | End-to-end Phase 2 pipeline evaluation on train queries |
09_phase2_submission |
2 | Standalone Phase 2 pipeline — Kaggle/Colab/Local compatible |
Recommended execution order:
01 → 02 → 03 → 04 → 05 → 06 for Phase 1, then 07 → 08 → 09 for Phase 2.
| Library | Role |
|---|---|
numpy / scipy |
Numerical operations, sparse matrices, similarity computation |
pandas |
Result tables, evaluation summaries, submission formatting |
scikit-learn |
TF-IDF, cosine similarity, LinearSVC, LogisticRegression, MLP, train/val/test splits |
rank-bm25 |
BM25 / BM25+ lexical retrieval |
sentence-transformers |
Dense document and query embeddings |
torch / transformers |
Embedding model backend |
umap-learn |
2D projection of embedding spaces (notebook 04) |
tqdm |
Progress bars throughout the pipelines |
kaggle |
Competition data download and submission CLI |
Python: 3.10+ · Key versions: numpy==2.3, scikit-learn==1.8, sentence-transformers==5.2, torch==2.10
| Name | GitHub |
|---|---|
| Thomas Gourmelen | @thmsgo18 |
| Maria Aydin | @Mmajora53 |
| Vincent Tan | @20centan |
| Clara Ait Mokhtar | @claraait123 |
Master IAD S2 — Université Paris — 2025/2026