Skip to content

Pringled/pyversity

Repository files navigation

Pyversity logo
Fast Diversification for Search & Retrieval

Pyversity is a fast, lightweight library for diversifying retrieval results. Retrieval systems often return highly similar items. Pyversity efficiently re-ranks these results to encourage diversity, surfacing items that remain relevant but less redundant.

It implements several popular diversification strategies such as MMR, MSD, DPP, and Cover with a clear, unified API. More information about the supported strategies can be found in the supported strategies section. The only dependency is NumPy, making the package very lightweight.

Quickstart

Install pyversity with:

pip install pyversity

Diversify retrieval results:

import numpy as np
from pyversity import diversify, Strategy

# Define embeddings and scores (e.g. cosine similarities of a query result)
embeddings = np.random.randn(100, 256)
scores = np.random.rand(100)

# Diversify the result
diversified_result = diversify(
    embeddings=embeddings,
    scores=scores,
    k=10, # Number of items to select
    strategy=Strategy.MMR, # Diversification strategy to use
    diversity=0.5 # Diversity parameter (higher values prioritize diversity)
)

# Get the indices of the diversified result
diversified_indices = diversified_result.indices

The returned DiversificationResult can be used to access the diversified indices, as well as the selection_scores of the selected strategy and other useful info. The strategies are extremely fast and scalable: this example runs in milliseconds.

The diversity parameter tunes the trade-off between relevance and diversity: 0.0 focuses purely on relevance (no diversification), while 1.0 maximizes diversity, potentially at the cost of relevance.

Supported Strategies

The following table describes the supported strategies, how they work, their time complexity, and when to use them. The papers linked in the references section provide more in-depth information on the strengths/weaknesses of the supported strategies.

Strategy What It Does Time Complexity When to Use
MMR (Maximal Marginal Relevance) Keeps the most relevant items while down-weighting those too similar to what’s already picked. O(k · n · d) Good default. Fast, simple, and works well when you just want to avoid near-duplicates.
MSD (Max Sum of Distances) Prefers items that are both relevant and far from all previous selections. O(k · n · d) Use when you want stronger spread, i.e. results that cover a wider range of topics or styles.
DPP (Determinantal Point Process) Samples diverse yet relevant items using probabilistic “repulsion.” O(k · n · d + n · k²) Ideal when you want to eliminate redundancy or ensure diversity is built-in to selection.
COVER (Facility-Location) Ensures selected items collectively represent the full dataset’s structure. O(k · n²) Great for topic coverage or clustering scenarios, but slower for large n.
SSD (Sliding Spectrum Decomposition) Sequence‑aware diversification: rewards novelty relative to recently shown items. O(k · n · d) Great for content feeds & infinite scroll, e.g. social/news/product feeds where users consume sequentially, as well as conversational RAG to avoid showing similar chunks within the recent window.

Motivation

Traditional retrieval systems rank results purely by relevance (how closely each item matches the query). While effective, this can lead to redundancy: top results often look nearly identical, which can create a poor user experience.

Diversification techniques like MMR, MSD, COVER, and DPP help balance relevance and variety. Each new item is chosen not only because it’s relevant, but also because it adds new information that wasn’t already covered by earlier results.

This improves exploration, user satisfaction, and coverage across many domains, for example:

  • E-commerce: Show different product styles, not multiple copies of the same product.
  • News search: Highlight articles from different outlets or viewpoints.
  • Academic retrieval: Surface papers from different subfields or methods.
  • RAG / LLM contexts: Avoid feeding the model near-duplicate passages.
  • Recommendation feeds: Keep content diverse and engaging over time.

Examples

The following examples illustrate how to apply different diversification strategies in various scenarios.

Product / Web Search — Simple diversification with MMR or DPP

MMR and DPP are great general-purpose diversification strategies. They are fast, easy to use, and work well in many scenarios. For example, in a product search setting where you want to show diverse items to a user, you can diversify the top results as follows:

from pyversity import diversify, Strategy

# Suppose you have:
# - item_embeddings: embeddings of the retrieved products
# - item_scores: relevance scores for these products

# Re-rank with MMR
result = diversify(
    embeddings=item_embeddings,
    scores=item_scores,
    k=10,
    strategy=Strategy.MMR,
)
Literature Search — Represent the full topic space with COVER

COVER (Facility-Location) is well-suited for scenarios where you want to ensure that the selected items collectively represent the entire dataset’s structure. For instance, when searching for academic papers on a broad topic, you might want to cover various subfields and methodologies:

from pyversity import diversify, Strategy

# Suppose you have:
# - paper_embeddings: embeddings of the retrieved papers
# - paper_scores: relevance scores for these papers

# Re-rank with COVER
result = diversify(
    embeddings=paper_embeddings,
    scores=paper_scores,
    k=10,
    strategy=Strategy.COVER,
)
Conversational RAG — Avoid redundant chunks with SSD

In retrieval-augmented generation (RAG) for conversational AI, it’s crucial to avoid feeding the model redundant or similar chunks of information within the recent conversation context. The SSD (Sliding Spectrum Decomposition) strategy is designed for sequence-aware diversification, making it ideal for this use case:

import numpy as np
from pyversity import diversify, Strategy

# Suppose you have:
# - chunk_embeddings (for retrieved chunks this turn)
# - chunk_scores (relevance scores for these chunks)
# - recent_chunk_embeddings (chunks shown in the last few turns (oldest→newest)

# Re-rank with SSD (sequence-aware)
result = diversify(
    embeddings=chunk_embeddings,
    scores=chunk_scores,
    k=10,
    strategy=Strategy.SSD,
    recent_embeddings=recent_chunk_embeddings,
)

# Maintain the rolling context window for recent chunks
recent_chunk_embeddings = np.vstack([recent_chunk_embeddings, chunk_embeddings[result.indices]])
Infinite Scroll / Recommendation Feed — Sequence-aware novelty with SSD

In content feeds or infinite scroll scenarios, users consume items sequentially. To keep the experience engaging, it’s important to introduce novelty relative to recently shown items. The SSD strategy is well-suited for this:

import numpy as np
from pyversity import diversify, Strategy

# Suppose you have:
# - feed_embeddings: embeddings of candidate items for the feed
# - feed_scores: relevance scores for these items
# - recent_feed_embeddings: embeddings of recently shown items in the feed (oldest→newest)

# Sequence-aware re-ranking with Sliding Spectrum Decomposition (SSD)
result = diversify(
    embeddings=feed_embeddings,
    scores=feed_scores,
    k=10,
    strategy=Strategy.SSD,
    recent_embeddings=recent_feed_embeddings,
)

# Maintain the rolling context window for recent items
recent_feed_embeddings = np.vstack([recent_feed_embeddings, feed_embeddings[result.indices]])
Single Long Document — Pick diverse sections with MSD

When summarizing or extracting information from a single long document, it’s beneficial to select sections that are both relevant and cover different parts of the document. The MSD strategy helps achieve this by preferring items that are far apart from each other:

from pyversity import diversify, Strategy

# Suppose you have:
# - doc_chunk_embeddings: embeddings of document chunks
# - doc_chunk_scores: relevance scores for these chunks

# Re-rank with MSD
result = diversify(
    embeddings=doc_chunk_embeddings,
    scores=doc_chunk_scores,
    k=10,
    strategy=Strategy.MSD,
)
Advanced Usage — Customizing Strategies
Some strategies support additional parameters. For example, the SSD (Sliding Spectrum Decomposition) strategy allows you to provide a set of recent_embeddings to encourage novelty relative to recently shown items. In addition to calling the supported strategies via the diversify function, you can also call them directly. For example, using SSD directly:

from pyversity import ssd
import numpy as np

items_to_select = 10 # Number of items to select

new_embeddings = np.random.randn(100, 256) # Embeddings of candidate items
new_scores = np.random.rand(100) # Relevance scores of candidate items
recent_embeddings = np.random.randn(items_to_select, 256) # Embeddings of recently shown items

# Sequence-aware diversification with SSD
result = ssd(
    embeddings=new_embeddings,
    scores=new_scores,
    k=items_to_select, # Number of items to select
    diversity=0.5,# Diversity parameter (higher values prioritize diversity)
    recent_embeddings=recent_embeddings, # Embeddings of recently shown items
    # More SSD specific parameters can be set as needed
)

# Update the rolling context window by adding the newly selected items to recent embeddings
recent = np.vstack([recent_embeddings, new_embeddings[result.indices]])[-items_to_select:]

References

The implementations in this package are based on the following research papers:

  • MMR: Carbonell, J., & Goldstein, J. (1998). The use of MMR, diversity-based reranking for reordering documents and producing summaries. Link

  • MSD: Borodin, A., Lee, H. C., & Ye, Y. (2012). Max-sum diversification, monotone submodular functions and dynamic updates. Link

  • COVER: Puthiya Parambath, S. A., Usunier, N., & Grandvalet, Y. (2016). A coverage-based approach to recommendation diversity on similarity graph. Link

  • DPP: Kulesza, A., & Taskar, B. (2012). Determinantal Point Processes for Machine Learning. Link

  • DPP (efficient greedy implementation): Chen, L., Zhang, G., & Zhou, H. (2018). Fast greedy MAP inference for determinantal point process to improve recommendation diversity. Link

  • SSD: Huang, Y., Wang, W., Zhang, L., & Xu, R. (2021). Sliding Spectrum Decomposition for Diversified Recommendation. Link

Author

Thomas van Dongen

Citation

If you use Pyversity in your research, please cite the following:

@software{van_dongen_2025_pyversity,
  author       = {{van Dongen}, Thomas},
  title        = {Pyversity: Fast Diversification for Search & Retrieval},
  year         = {2025},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.17628015},
  url          = {https://github.com/Pringled/pyversity},
  license      = {MIT}
}

About

Fast Diversification for Search & Retrieval

Resources

License

Stars

Watchers

Forks

Packages

No packages published