Skip to content

Addition of High-Dimensional Wikipedia Embedding Datasets (1024/4096/8192D) for enhanced Realistic ANN Benchmarking #591

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

KaranSinghDev
Copy link

Hi,
While looking into running benchmarks for ANN algorithms, especially thinking about how they perform with embeddings from modern large language models, I noticed that most of the standard datasets currently available in ann-benchmarks have relatively low dimensions (like SIFT, GIST, etc.). These are great, but they don't fully capture the challenges of working with the really high-dimensional vectors (thousands of dimensions) we often see today from transformers.

To help bridge this gap and provide more realistic test cases, I thought it would be useful to add support for some new, high-dimensional datasets based on Wikipedia embeddings.

This PR adds support for three new datasets:

  • wikipedia-1024-angular (1024 dimensions)
  • wikipedia-4096-angular (4096 dimensions)
  • wikipedia-8192-angular (8192 dimensions)

What I changed:

  1. ann_benchmarks/datasets.py: Added the entries for these new datasets, pointing to where their HDF5 files will be located.
  2. README.md: Updated the Datasets table in the README to include these new Wikipedia datasets, mentioning their dimensions and adding placeholder links for download (I'll update these links once the datasets are finalized and hosted!).

Potential Benefit

These datasets (each with ~1 billion vectors) should allow for more challenging and representative benchmarking of ANN algorithms, particularly testing how they scale and perform in the high-dimensional scenarios common with modern AI models.

User should be able to use them like any other dataset once the data is available, e.g., python run.py --dataset wikipedia-4096-angular .... I made sure the changes are compatible with the existing setup.

Hope this addition is useful for the project! Nice repo.

@KaranSinghDev KaranSinghDev changed the title Add High-Dimensional Wikipedia Embedding Datasets (1024/4096/8192D) for enhanced Realistic ANN Benchmarking Addition of High-Dimensional Wikipedia Embedding Datasets (1024/4096/8192D) for enhanced Realistic ANN Benchmarking May 4, 2025
@erikbern
Copy link
Owner

erikbern commented May 4, 2025

Hi – thanks for sharing these. In order for us to merge this, we would want also the code to create these datasets. That just makes the code more reproducible, especially if people want to do more experiments tuning things etc.

I agree that modern AI (in particular LLMs) use much higher dimensionality than when I started building ANN-benchmarks. I think it's hard to push it to billions of vectors though since now you're talking terabytes of data. That means you need disk-based models instead. Which is maybe fine, but I think the benchmarks would be very extensive. So I think with ann-benchmarks the way things are today, it's probably better to limit to 1-10M vectors.

@KaranSinghDev
Copy link
Author

KaranSinghDev commented May 4, 2025

Hi @erikbern ,

Thanks so much for the response and feedback! I really appreciate you taking the time to look this over, and I definitely agree with your points.

I also completely understand your point about the scale. Pushing to billions of vectors definitely moves into the realm requiring disk-based ANN approaches, which might be beyond the current scope focused on in-memory benchmarks. My initial thought was driven by seeing those very large, high-dimensional outputs from LLMs, but I see how benchmarking them directly here presents significant challenges in terms of data size and algorithm types.

My main motivation was really tackling the dimensionality aspect, as the current datasets don't seem to quite reflect the 1k-8k+ dimensions common with transformers.

With that in mind, maybe we could adapt the idea? Would it be helpful if I:

  1. Focused on providing the dataset generation code?
  2. Generated smaller versions of these high-dimensional Wikipedia datasets (say, in the 1M-20M vector range similar to what you suggested) that would fit the current framework?

If that sounds reasonable, I could potentially update this PR (or create a new one) to include:

  • The cleaned-up generation script.
  • The dataset entries in datasets.py (for dimensions 1024, 4096, 8192).
  • The updated README table entries.
  • Links to these new, smaller (1M-10M vector) HDF5 datasets once generated and hosted.

This way, the benchmark could gain support for these relevant high dimensions without running into the infrastructure challenges of the 1B vector scale right now.

If creating smaller (e.g., 1M-20M vectors) versions of these high-dimensional datasets (1k-8k dimensions) along with their generation scripts sounds like a potentially useful addition, I'd be happy to work on that for a revised PR.

Let me know what you think! Thank you again for the guidance .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants