Skip to content

TusKANNy/awesome-learned-sparse-retrieval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

95 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Awesome Learned Sparse Retrieval

Awesome Project Status
An extensive and commented list of resources on Learned Sparse Retrieval (LSR). Most of the resources below refer to learned sparse representations for text retrieval.

Contents

LSR Models

  • From Neural Re-Ranking to Neural Ranking: Learning a Sparse Representation for Inverted Indexing
    Hamed Zamani, Mostafa Dehghani, W. Bruce Croft, Erik Learned-Miller, Jaap Kamps
    CIKM, 2018
    πŸ“„ paper
  • Expansion via Prediction of Importance with Contextualization
    Sean MacAvaney, Franco Maria Nardini, Raffaele Perego, Nicola Tonellotto, Nazli Goharian, Ophir Frieder
    SIGIR, 2020
    πŸ“„ paper
  • Context-Aware Term Weighting For First Stage Passage Retrieval
    Zhuyun Dai, Jamie Callan
    SIGIR, 2020
    πŸ“„ paper
  • A Few Brief Notes on DeepImpact, COIL, and a Conceptual Framework for Information Retrieval Techniques
    Jimmy Lin, Xueguang Ma
    CoRR, 2021
    πŸ“„ paper
  • Learning Passage Impacts for Inverted Indexes
    Antonio Mallia, Omar Khattab, Torsten Suel, Nicola Tonellotto
    SIGIR, 2021
    πŸ“„ paper
  • SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking
    Thibault Formal, Benjamin Piwowarski, Stephane Clinchant
    SIGIR, 2021
    πŸ“„ paper
  • SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval
    Thibault Formal, Carlos Lassance, Benjamin Piwowarski, Stephane Clinchant
    CoRR, 2021
    πŸ“„ paper
  • SPARTA: Efficient Open-Domain Question Answering via Sparse Transformer Matching Retrieval
    Tiancheng Zhao, Xiaopeng Lu, Kyusong Lee
    NAACL, 2021
    πŸ“„ paper
  • TILDE: Term Independent Likelihood moDEl for Passage Re-ranking
    Shengyao Zhuang, Guido Zuccon
    SIGIR, 2021
    πŸ“„ paper
  • From Distillation to Hard Negative Sampling: Making Sparse Neural IR Models More Effective
    Thibault Formal, Carlos Lassance, Benjamin Piwowarski, Stephane Clinchant
    SIGIR, 2022
    πŸ“„ paper
  • Fast Passage Re-ranking with Contextualized Exact Term Matching and Efficient Passage Expansion
    Shengyao Zhuang, Guido Zuccon
    ReNeuIR at SIGIR, 2022
    πŸ“„ paper
  • An Efficiency Study for SPLADE Models
    Carlos Lassance, StΓ©phane Clinchant
    SIGIR, 2022
    πŸ“„ paper
  • Learning a Sparse Representation Model for Neural CLIR
    Suraj Nair, Eugene Yang, Dawn J Lawrie, James Mayfield, Douglas W. Oard
    DESIRES, 2022
    πŸ“„ paper
  • LexMAE: Lexicon-Bottlenecked Pretraining for Large-Scale Retrieval
    Tao Shen, Xiubo Geng, Chongyang Tao, Can Xu, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang
    ICLR, 2023
    πŸ“„ paper
  • A Unified Framework for Learned Sparse Retrieval
    Thong Nguyen, Sean MacAvaney, Andrew Yates
    ECIR, 2023
    πŸ“„ paper
  • BLADE: Combining Vocabulary Pruning and Intermediate Pretraining for Scaleable Neural CLIR
    Suraj Nair, Eugene Yang, Dawn Lawrie, James Mayfield, Douglas W. Oard
    SIGIR, 2023
    πŸ“„ paper
  • Exploring the Representation Power of SPLADE Models
    Joel Mackenzie, Shengyao Zhuang, Guido Zuccon
    ICTIR, 2023
    πŸ“„ paper
  • Learning Sparse Lexical Representations Over Specified Vocabularies for Retrieval
    Jeffrey M Dudek, Weize Kong, Cheng Li, Mingyang Zhang, Michael Bendersky
    CIKM, 2023
    πŸ“„ paper
  • Improved Learned Sparse Retrieval with Corpus-Specific Vocabularies
    Puxuan Yu, Antonio Mallia, Matthias Petri
    ECIR, 2024
    πŸ“„ paper
  • Two-Step SPLADE: Simple, Efficient and Effective Approximation of SPLADE
    Carlos Lassance, HervΓ© DΓ©jean, Stephane Clinchant, Nicola Tonellotto
    ECIR, 2024
    πŸ“„ paper
  • SPLATE: Sparse Late Interaction Retrieval
    Thibault Formal, Stephane Clinchant, HervΓ© DΓ©jean, Carlos Lassance
    SIGIR, 2024
    πŸ“„ paper
  • Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control
    Thong Nguyen, Mariya Hendriksen, Andrew Yates, Maarten de Rijke
    ECIR, 2024
    πŸ“„ paper
  • DyVo: Dynamic Vocabularies for Learned Sparse Retrieval with Entities
    Thong Nguyen, Shubham Chatterjee, Sean MacAvaney, Iain Mackie, Jeff Dalton, Andrew Yates
    EMNLP, 2024
    πŸ“„ paper
  • SPLADE-v3: New baselines for SPLADE
    Carlos Lassance, HervΓ© DΓ©jean, Thibault Formal, Stephane Clinchant
    CoRR, 2024
    πŸ“„ paper
  • DiSCo: LLM Knowledge Distillation for Efficient Sparse Retrieval in Conversational Search
    Simon Lupart, Mohammad Aliannejadi, Evangelos Kanoulas
    SIGIR, 2025
    πŸ“„ paper
  • Towards Competitive Search Relevance For Inference-Free Learned Sparse Retrievers
    Zhichao Geng, Dongyu Ru, Yang Yang
    CoRR, 2024
    πŸ“„ paper
  • Mistral-SPLADE: LLMs for better Learned Sparse Retrieval
    Meet Doshi, Vishwajeet Kumar, Rudra Murthy, Vignesh P, Jaydeep Sen
    CoRR, 2024
    πŸ“„ paper
  • An Alternative to FLOPS Regularization to Effectively Productionize SPLADE-doc
    Aldo Porco, Dhruv Mehra, Igor Malioutov, Karthik Radhakrishnan, Moniba Keymanesh, Daniel Preotiuc-Pietro, Sean MacAvaney, Pengxiang Cheng
    SIGIR, 2025
    πŸ“„ paper
  • Effective Inference-Free Retrieval for Learned Sparse Representations
    Franco Maria Nardini, Thong Nguyen, Cosimo Rulli, Rossano Venturini, Andrew Yates
    SIGIR, 2025
    πŸ“„ paper
  • Exploring $\ell_0$ Sparsification for Inference-free Sparse Retrievers
    Xinjie Shen, Zhichao Geng, Yang Yang
    SIGIR, 2025
    πŸ“„ paper
  • Enhancing Lexicon-Based Text Embeddings with Large Language Models
    Yibin Lei, Tao Shen, Yu Cao, Andrew Yates
    ACL, 2025
    πŸ“„ paper
  • Leveraging decoder architectures for learned sparse retrieval
    Jingfen Qiao, Thong Nguyen, Evangelos Kanoulas, Andrew Yates
    International Workshop on Knowledge-Enhanced Information Retrieval, 2025
    πŸ“„ paper
  • Scaling sparse and dense retrieval in decoder-only LLMs
    Hansi Zeng, Julian Killingback, Hamed Zamani
    SIGIR, 2025
    πŸ“„ paper
  • CSPLADE: Learned Sparse Retrieval with Causal Language Models
    Zhichao Xu, Aosong Feng, Yijun Tian, Haibo Ding, Lin Lee Cheong
    CoRR, 2025
    πŸ“„ paper
  • On the Reproducibility of Learned Sparse Retrieval Adaptations for Long Documents
    Emmanouil Georgios Lionis, Jia-Huei Ju
    ECIR, 2025
    πŸ“„ paper
  • Learning Retrieval Models with Sparse Autoencoders
    Thibault Formal, Maxime Louis, HervΓ© DΓ©jean, StΓ©phane Clinchant
    ICLR, 2026
    πŸ“„ paper
  • Milco: Learned Sparse Retrieval Across Languages via a Multilingual Connector
    Thong Nguyen, Yibin Lei, Jia-Huei Ju, Eugene Yang, Andrew Yates
    ICLR, 2026
    πŸ“„ paper
  • LACONIC: Dense-Level Effectiveness for Scalable Sparse Retrieval via a Two-Phase Training Curriculum
    Zhichao Xu, Shengyao Zhuang, Crystina Zhang, Xueguang Ma, Yijun Tian, Maitrey Mehta, Jimmy Lin, Vivek Srikumar
    SIGIR, 2026
    πŸ“„ paper
  • Sparton: Fast and Memory-Efficient Triton Kernel for Learned Sparse Retrieval
    Thong Nguyen, Cosimo Rulli, Franco Maria Nardini, Rossano Venturini, Andrew Yates
    SIGIR, 2026
    πŸ“„ paper
  • Self-Improving Sparse Retrieval Through Heuristic Representation Refinement and Representation-Focused Learning
    Xiaojing Li, Bin Wang, Xiaochun Yang, Meng Luo
    AAAI, 2026
    paper
  • From Tokens to Concepts: Leveraging SAE for SPLADE
    Yuxuan Zong, Mathias Vast, Basile Van Cooten, Laure Soulier, Benjamin Piwowarski
    SIGIR, 2026
    πŸ“„ paper
  • To Case or Not to Case: An Empirical Study in Learned Sparse Retrieval
    Emmanouil Georgios Lionis, Jia-Huei Ju, Angelos Nalmpantis, Casper Thuis, Sean MacAvaney, Andrew Yates
    ECIR, 2026
    paper

Indexing LSR

  • Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs
    Yury A. Malkov, Dmitry A. Yashunin
    IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020
    πŸ“„ paper
  • Efficiency Implications of Term Weighting for Passage Retrieval
    Joel Mackenzie, Zhuyun Dai, Luke Gallagher, Jamie Callan
    SIGIR, 2020
    πŸ“„ paper
  • Wacky Weights in Learned Sparse Representations and the Revenge of Score-at-a-Time Query Evaluation
    Joel Mackenzie, Andrew Trotman, Jimmy Lin
    CoRR, 2021
    πŸ“„ paper
  • Insights into the Efficiency of Open-Source Score-at-a-Time Search Engines: A Reproducibility Study
    Katelyn Harlan, Andrew Trotman, Veronica Liesaputra
    SIGIR, 2026
    πŸ“„ paper
  • Accelerating Learned Sparse Indexes Via Term Impact Decomposition
    Joel Mackenzie, Antonio Mallia, Alistair Moffat, Matthias Petri
    EMNLP, 2022
    πŸ“„ paper | πŸ› οΈ code
  • An Efficiency Study for SPLADE Models
    Carlos Lassance, Stephane Clinchant
    SIGIR, 2022
    πŸ“„ paper
  • Faster Learned Sparse Retrieval with Guided Traversal
    Antonio Mallia, Joel Mackenzie, Torsten Suel, Nicola Tonellotto
    SIGIR, 2022
    πŸ“„ paper
  • IOQP: A simple Impact-Ordered Query Processor written in Rust
    Joel Mackenzie, Matthias Petri, Luke Gallagher
    DESIRES, 2022
    πŸ“„ paper | πŸ› οΈ code
  • A Static Pruning Study on Sparse Neural Retrievers
    Carlos Lassance, Simon Lupart, Herve Dejean, Stephane Clinchant, Nicola Tonellotto
    SIGIR, 2023
    πŸ“„ paper
  • An Approximate Algorithm for Maximum Inner Product Search over Streaming Sparse Vectors
    Sebastian Bruch, Franco Maria Nardini, Amir Ingber, Edo Liberty
    ACM Transactions on Information Systems, 2024
    πŸ“„ paper
  • Efficient Document-at-a-time and Score-at-a-time Query Evaluation for Learned Sparse Representations
    Joel Mackenzie, Andrew Trotman, Jimmy Lin
    ACM Transactions on Information Systems, 2023
    πŸ“„ paper
  • Optimizing Guided Traversal for Fast Learned Sparse Retrieval
    Yifan Qiao, Yingrui Yang, Haixin Lin, Tao Yang
    WWW, 2023
    πŸ“„ paper | πŸ› οΈ code
  • Representation Sparsification with Hybrid Thresholding for Fast SPLADE-based Document Retrieval
    Yifan Qiao, Yingrui Yang, Shanxiu He, Tao Yang
    SIGIR, 2023
    πŸ“„ paper | πŸ› οΈ code
  • Results of the Big ANN: NeurIPS'23 competition
    Harsha Vardhan Simhadri, Martin AumΓΌller, Amir Ingber, Matthijs Douze, George Williams, Magdalen Dobson Manohar, Dmitry Baranchuk, Edo Liberty, Frank Liu, Ben Landrum, Mazin Karjikar, Laxman Dhulipala, Meng Chen, Yue Chen, Rui Ma, Kai Zhang, Yuzheng Cai, Jiayang Shi, Yizhuo Chen, Weiguo Zheng, Zihao Wan, Jie Yin, Ben Huang
    CoRR, 2023
    πŸ“„ paper | πŸ› οΈ code
  • Bridging Dense and Sparse Maximum Inner Product Search
    Sebastian Bruch, Franco Maria Nardini, Amir Ingber, Edo Liberty
    ACM Transactions on Information Systems, 2024
    πŸ“„ paper
  • Efficient Inverted Indexes for Approximate Retrieval over Learned Sparse Representations
    Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, Rossano Venturini
    SIGIR, 2024
    πŸ“„ paper | πŸ› οΈ code
  • Faster Learned Sparse Retrieval with Block-Max Pruning
    Antonio Mallia, Torsten Suel, Nicola Tonellotto
    SIGIR, 2024
    πŸ“„ paper | πŸ› οΈ code
  • Cluster-based Partial Dense Retrieval Fused with Sparse Text Retrieval
    Yingrui Yang, Parker Carlson, Shanxiu He, Yifan Qiao, Tao Yang
    SIGIR, 2024
    πŸ“„ paper
  • Pairing Clustered Inverted Indexes with ΞΊ-NN Graphs for Fast Approximate Retrieval over Learned Sparse Representations
    Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, Rossano Venturini
    CIKM, 2024
    πŸ“„ paper | πŸ› οΈ code
  • Threshold-driven Pruning with Segmented Maximum Term Weights for Approximate Cluster-based Sparse Retrieval
    Yifan Qiao, Parker Carlson, Shanxiu He, Yingrui Yang, Tao Yang
    EMNLP, 2024
    πŸ“„ paper
  • Foundations of Vector Retrieval
    Sebastian Bruch
    Springer, 2024
    πŸ“„ book
  • Dynamic Superblock Pruning for Fast Learned Sparse Retrieval
    Parker Carlson, Wentai Xie, Shanxiu He, Tao Yang
    SIGIR, 2025
    πŸ“„ paper | πŸ› οΈ code
  • Efficient Sketching and Nearest Neighbor Search Algorithms for Sparse Vector Sets
    Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, Rossano Venturini
    CoRR, 2025
    πŸ“„ paper | πŸ› οΈ code
  • Investigating the Scalability of Approximate Sparse Retrieval Algorithms to Massive Datasets
    Sebastian Bruch, Franco Maria Nardini, Cosimo Rulli, Rossano Venturini, Leonardo Venuta
    ECIR, 2025
    πŸ“„ paper | πŸ› οΈ code
  • SINDI: an Efficient Index for Approximate Maximum Inner Product Search on Sparse Vectors
    Ruoxuan Li, Xiaoyao Zhong, Jiabao Jin, Peng Cheng, Wangze Ni, Lei Chen, Zhitao Shen, Wei Jia, Xiangyu Wang, Xuemin Lin, Heng Tao Shen, Jingkuan Song
    CoRR, 2025
    πŸ“„ paper | πŸ› οΈ code
  • kANNolo: Sweet and Smooth Approximate k-Nearest Neighbors Search
    Leonardo Delfino, Domenico Erriquez, Silvio Martinico, Franco Maria Nardini, Cosimo Rulli, Rossano Venturini
    ECIR, 2025
    πŸ“„ paper | πŸ› οΈ code
  • Breaking the Curse of Dimensionality: On the Stability of Modern Vector Retrieval
    Vihan Lakshman, Blaise Munyampirwa, Julian Shun, Benjamin Coleman
    CoRR, 2025
    πŸ“„ paper
  • Efficiency Optimizations for Superblock-based Sparse Retrieval
    Parker Carlson, Wentai Xie, Rohil Shah, Tao Yang
    CoRR, 2026
    πŸ“„ paper
  • Evaluating the Efficiency and Effectiveness of Learned Sparse Retrieval with the lsr_benchmark
    Maik FrΓΆbe, Ferdinand Schlatt, Cosimo Rulli, Tim Hagen, Jan Heinrich Merker, Gijs Hendriksen, Carlos Lassance, Franco Maria Nardini, Rossano Venturini, Martin Potthast
    ECIR, 2026
    πŸ“„ paper | πŸ› οΈ code
  • Forward Index Compression for Learned Sparse Retrieval
    Sebastian Bruch, Martino Fontana, Franco Maria Nardini, Cosimo Rulli, Rossano Venturini
    ECIR, 2026
    πŸ“„ paper | πŸ› οΈ code
  • Fast, Compact, Immediate-Access Indexing for Learned Sparse Retrieval Systems
    Billy Rule, Joel Mackenzie
    ECIR, 2026
    πŸ“„ paper

Tutorials

Software Libraries

  • lsr-benchmark Python
    Framework for the evaluation of the learned sparse retrieval paradigm to contrast efficiency and effectiveness across diverse retrieval scenarios
  • kANNolo Python Rust
    Library for fast dense/sparse learned retrieval with graph-based indexes.
  • Seismic Python Rust
    State-of-the-Art library for fast sparse learned retrieval and indexing with focus on efficient inverted-index structures built on modern research
  • Vectorium Python Rust
    A library for storing and accessing datasets of dense and sparse vectors, with efficient support for brute-force search and the core operations required by vector indexing data structures.
  • BMP (Block Max Pruning) Python Rust
    Extremely efficient learned sparse retrieval framework with an in-memory block-based inverted index.
  • Sentence Transformer Python
    Framework for generating sentence embeddings and sparse encoders for semantic and sparse retrieval in NLP applications.
  • Pyserini Python
    IR research toolkit built on Lucene that supports learned sparse retrieval models.
  • NMSLib Python C++
    Non-Metric Space Library (NMSLIB): An efficient similarity search library and a toolkit for evaluation of k-NN methods for generic non-metric spaces.
  • OpenSearch Python Java
    Open-source search and analytics engine that supports scalable sparse, dense, and hybrid neural retrieval via plugins and vector extensions.
  • Apache Lucene Python Java
    High-performance Java search library providing inverted indexes and scoring infrastructure that underpins many learned sparse retrieval systems.
  • Qdrant Python Rust
    Open-source vector database supporting dense, sparse, and hybrid retrieval with native sparse vector indexing based on an inverted index for exact high-dimensional sparse search.
  • FlashRAG Python
    A Python Toolkit for Efficient RAG Research.
  • PyTerrier Python
    A Python framework for performing information retrieval experiments (supporting PISA and BMP).
  • PISA C++
    A modular C++ inverted index and query processing framework supporting many compression codecs and dynamic pruning algorithms.

Datasets and Encodings

MS MARCO v1

  • Documents: 8,841,823
  • Queries [dev.small]: 6,980
  • Reference Metric: MRR@10
Encoding Link Avg non-zero (docs) Avg non-zero (queries) MRR@10
splade-cocondenser link 119 43 38.3
efficient-splade link 181 6 38.8
uniCOIL-T5 link 68 6 35.2
splade-v3 link 168 24 40.3
li-lsr-big link 387 6 38.8
laconic-1B link 511 100 37.2

MS MARCO v2

  • Documents: 138,363,364
  • Queries [dev1.small]: 3,903
  • Reference Metric: MRR@10
Encoding Link Avg non-zero (docs) Avg non-zero (queries) MRR@10
splade-cocondenser link 127 44 10.88

NQ

  • Documents: 2,680,893
  • Queries: 3,452
  • Reference Metric: NDCG@10
Encoding Link Avg non-zero (docs) Avg non-zero (queries) NDCG@10
splade-cocondenser link 153 51 53.9

LoTTE-pooled

  • Documents: 2,428,854
  • Queries [dev/search]: 2,931
  • Reference Metric: Success@5
Encoding Link Avg non-zero (docs) Avg non-zero (queries) Success@5
splade-cocondenser N/A N/A N/A 69.0
li-lsr-big link 469 9 65.7

Quora

  • Documents: 522,931
  • Queries [test] : 10,000
  • Reference Metric: nDCG@10
Encoding Link Avg non-zero (docs) Avg non-zero (queries) nDCG@10
splade-v3 link 40 36 81.4

Other Collections

List Maintainers (alphabetical order)

Franco Maria Nardini (ISTI-CNR, Pisa, Italy)
Cosimo Rulli (ISTI-CNR, Pisa, Italy)
Rossano Venturini (University of Pisa, Italy)

Other Contributors

Joel Mackenzie (The University of Queensland, Australia)

About

An extensive and commented list of resources on Learned Sparse Retrieval.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages