nanoColBERT

This repo provides a simple implementation of ColBERT-v1 model.

The official github repo: Link (v1 branch)

ColBERT is a powerful late-interaction model that could perform both retrieval and reranking.

Get Started

conda create -n nanoColBERT python=3.8 && conda activate nanoColBERT
## install torch and faiss according to your CUDA version
pip install -r requirements.txt

Configure wandb and accelerate

wandb login
accelerate config

After everything setup, just launch the whole process with:

(if the download link is expired, please refer to #5, #2)

bash scripts/download.sh
bash scripts/run_colbert.sh

It would first download the data, preprocess the data, train the model, index with faiss, conduct retrieval and calculate the score.

Results

This is our reproduced results:

	MRR@10	Recall@50	Recall@200	Recall@1000
Reported	36.0	82.9	92.3	96.8
nanoColBERT	36.0	83.3	91.9	96.3

Please be aware that this repository serves solely as a conceptual guide and has not been heavily optimized for efficiency

The following reveals the duration of each step:

Step	Duration	Remark
tsv2mmap	3h5min
train	8h54min	400k steps on 1*A100
doc2emebdding	56min	8*A100
build_index	21min	30% training data with IVFPQ on 1*A100
retrieve	17min	6980 samples on 1*A100

Pretrained Ckpt

We also provide our trained model on the Huggingface Space and you could simply use it with:

from model import ColBERT
from transformers import BertTokenizer

pretrained_model = "nanoColBERT/ColBERTv1"
model = ColBERT.from_pretrained(pretrained_model)
tokenizer = BertTokenizer.from_pretrained(pretrained_model)

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
config		config
model		model
scripts		scripts
.gitignore		.gitignore
README.md		README.md
build_index.py		build_index.py
doc2embedding.py		doc2embedding.py
requirements.txt		requirements.txt
retrieve.py		retrieve.py
score.py		score.py
train_colbert.py		train_colbert.py
tsv2mmap.py		tsv2mmap.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nanoColBERT

Get Started

Results

Pretrained Ckpt

About

Releases

Packages

Languages

Hannibal046/nanoColBERT

Folders and files

Latest commit

History

Repository files navigation

nanoColBERT

Get Started

Results

Pretrained Ckpt

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages