Skip to content

BigVectorBench advances vector database benchmarking by defining and evaluating the embedding performance of heterogeneous data and abstracting compound queries, which can be multimodal or single-modal with fine-grained restrictions, for real-world applications.

License

Notifications You must be signed in to change notification settings

cococo2000/BigVectorBench

 
 

Repository files navigation

BigVectorBench

LICENSE Build Status Issues Issues PR PR

BigVectorBench is an innovative benchmark suite crafted to thoroughly evaluate the performance of vector databases. This project is born out of the realization that existing benchmarks fall short in assessing the critical capabilities of vector databases, particularly in handling heterogeneous data embeddings and executing compound queries. Our suite aims to fill this evaluation gap, providing a comprehensive framework for measuring the efficiency and capacity of vector databases in real-world scenarios.

Install

Clone the repository

git clone https://github.com/BenchCouncil/BigVectorBench.git

Make Environment

Tested on: Ubuntu 20.04

  • Docker Engine 27.x

  • Python 3.10 and dependencies

    • Create a new conda environment and install dependencies from requirements.txt:
conda create -n bigvectorbench python=3.10
conda activate bigvectorbench
pip3 install -r requirements.txt

Build Docker Images for Databases

The command below will build the Docker images for all databases/algorithms used in the BigVectorBench.

python install.py

Arguments:

  • --proc NUM: the number of process to build docker images (default: 1, build serially)
  • --algorithm NAME: build only the named algorithm image (default: None, build all)

Issues:

  • If you encounter an error while building the Docker image due to unreachable URLs, please try to modify the Dockerfile to use mirrors for the installation of the required packages.

Run Benchmarks

python run.py

Arguments:

  • --dataset NAME: run the benchmark on the named dataset (default: glove-100-angular)
  • -k COUNT, --count COUNT: the number of near neighbours to search for (default: 10)
  • --runs COUNT: run each algorithm instance COUNT times and use only the best result (default: 5)
  • --algorithm NAME: run only the named algorithm (default: None, run all)
  • --cpuset-cpus CPUSET: the CPUs in which to allow the container to run (e.g., 0-2, 0,1) only active in Docker mode (default: None, run on all CPUs)
  • --memory MEMORY: the memory limit for Docker containers, default is all available memory (default: None, run with all available memory)

Example:

python run.py --dataset app_reviews-384-euclidean-filter --count 100 --runs 3 --algorithm milvus-hnsw --cpuset-cpus 0-15 --memory 64g

Supported Databases and Algorithms

  • Milvus
    • milvus-flat
    • milvus-ivfflat
    • milvus-ivfsq8
    • milvus-ivfpq
    • milvus-hnsw
    • milvus-scann
    • milvus-gpu-bf
    • milvus-gpu-ivfflat
    • milvus-gpu-ivfpq
    • milvus-gpu-cagra
  • Weaviate
    • weaviate-flat
    • weaviate-hnsw
  • Qdrant
    • qdrant
  • Vearch
    • vearch-flat
    • vearch-biivf
    • vearch-ivfflat
    • vearch-ivfpq
    • vearch-hnsw
  • Redis
    • redis-flat
    • redis-hnsw
  • Elasitcsearch
    • elasticsearch-hnsw

TODO

Use-cases for Compound Queries

We have summarized corresponding use-cases for filter queries, multi-modal queries, multi-vector queries, and big queries, including E-commerce recommendation, keyframe recommendation, cross-modal text-to-image search, and long text search, as shown in the figure below.

use-cases

  1. E-commerce recommendation is a fundamental use-case of vector retrieval, where product reviews are embedded into vectors by a text embedding model, such as all-MiniLM-L12-v2, and combined with attribute filters such as product ratings. This is a key function in e-commerce applications like Amazon book. Users who have given similar reviews and ratings to a particular book can mutually recommend books. When generating personalized recommendations, the recommendation engine not only retrieve similar book reviews but also incorporates filters based on book ratings to significantly enhance the relevance and efficacy of the recommendations.
  2. Keyframe recommendation is an important use-case of vector retrieval. Keyframes are essential to the video-based applications, such as Twitch. Keyframes are extracted from original video to precisely summarize the video content. These vectors, derived from a user's viewing history, enable video-based applications to recommend videos or movies containing keyframes similar to those that are previously viewed, thereby providing more relevant contents that enhance recommendation accuracy based on past interactions.
  3. Cross-modal text-to-image search is a pivotal use-case of vector retrieval, where both text and image are converted into vectors using multi-modal embedding models like ImageBind. This use-case is increasingly employed in popular social networking platforms such as TikTok and REDnote. Users can input textual descriptions --- for instance, detailing a historical event or artwork --- and the application then retrieves and displays images that are visually similar to the described content. This approach not only makes the information more accessible but also enhances its visual appeal, facilitating a more intuitive understanding of the search results.
  4. Long text search represents an emerging use-case of vector retrieval that single text exceeding 1024 tokens are embedded into vectors with more than 768 dimensions using models like Jina Embeddings v2 or OpenAI embeddings. Long text search is a commonly used task in literature retrieval application allowing researchers to input entire documents and quickly locate the top-k academic publications by performing approximate nearest neighbors searches on the high-dimensional vector representations stored in vector databases.

Datasets

D1: arXiv and PubMed

The dataset D1 consists of abstracts from scientific papers sourced from arXiv and PubMed. It is utilized for tasks related to the summarization of scientific literature, enabling researchers and students to grasp the content of academic papers quickly. Accessible at Hugging Face - arXiv Summarization and Hugging Face - Scientific Papers.

D2: ImageNet

Dataset D2 is a subset of ImageNet, containing thousands of annotated images. It is widely used for image recognition, classification, and training deep learning models. This dataset can be found at Hugging Face - Tiny ImageNet.

D3: SQuAD 2.0

Dataset D3, SQuAD 2.0, is a reading comprehension dataset with questions on Wikipedia articles. It includes answerable questions from SQuAD 1.1 and new, unanswerable questions that resemble answerable ones. Systems must identify answerable questions and acknowledge when no answer exists in the text. Accessible at Hugging Face - SQuAD v2.

D4: img-wikipedia

Dataset D4 is a simplified version of the Wikipedia image dataset, containing 489,116 images from Wikipedia articles along with their corresponding text descriptions. The dataset can be accessed at Hugging Face - img-wikipedia-simple.

D5: LibriSpeech ASR

Dataset D5 is the LibriSpeech Automatic Speech Recognition (ASR) dataset, derived from read audiobooks of the LibriVox project, consisting of approximately 1,000 hours of 16kHz English speech, carefully segmented and aligned for Automatic Speech Recognition (ASR) research and development. The dataset is available at Hugging Face - LibriSpeech ASR.

Workloads

The workloads are available at link. The workloads are stored in the HDF5 format. The workloads are divided into the following categories:

Filter Queries

W1: ag_news

The Workload W1, designated as ag_news, is an extensive compilation of over 1 million news articles, aggregated from a diverse array of more than 2,000 news sources by ComeToMyHead, an academic news search engine in operation since July 2004. These articles are meticulously categorized into four primary themes: World, Sports, Business, and Science/Tech, rendering the dataset an invaluable asset for evaluating the performance of text classification algorithms. The dataset can be accessed at Hugging Face - ag_news.

Each news article is represented as a 384-dimensional vector, a transformation achieved through the all-MiniLM-L12-v2 embedding model. These vectors encapsulate a classification label ID, which facilitates filtering during query processes, with an average filtering ratio of approximately 25%. The Euclidean distance metric is employed to gauge the similarity between different news articles.

W2: cc_news

The original CC-News dataset includes 708,241 English news articles from 2017 to 2019, prepared with news-please and stored in the Common Crawl’s AWS S3 bucket. The raw dataset can be accessed at Hugging Face - cc_news.

In the CC-News dataset, each article is accompanied by a timestamp. Our temporal distribution analysis showed that the data was unevenly distributed before January 1, 2017, but exhibited uniform density from that date onward. Based on this, we selected 630,643 articles from this consistent period and randomly drew 10,000 as our test set. These articles were then transformed into 384-dimensional vectors using the all-MiniLM-L12-v2 embedding model, with each vector paired with its corresponding timestamp as a label.

In our approach to querying the CC-News dataset, we utilize the Euclidean distance metric to measure the similarity between articles. The filtering process is designed to retrieve articles that fall within a three-day window prior to the timestamp specified in the query. This temporal constraint aids in focusing the search on the most relevant and recent articles. Our analysis of the test set reveals that using the timestamp as a filtering criterion, 83% of the test queries result in a filtering ratio that exceeds 10%, 12% of these queries have a filtering ratio that is above 1% but does not exceed 10%, and the remaining queries have a filtering ratio of less than 1%.

W3: app_reviews

The raw dataset app_reviews, available at Hugging Face - app_reviews, offers a comprehensive view of user feedback and code metric evolution for 395 Android apps across 23 categories from the F-Droid repository, encompassing 600 versions and 287,936 user reviews.

We developed the app_reviews workload with a triplet of filtering criteria—timestamp, review length, and star rating—randomly selecting 10,000 entries for our test set. The review texts were transformed into 384-dimensional vectors using the all-MiniLM-L12-v2 model, with Euclidean distance as the metric for review similarity.

We filter reviews by timestamp to match the app’s latest version, considering only the past 30 days to align with the monthly update cycle. Review length is filtered for readability between -30 and +30. Star ratings are filtered strategically: 1 → {1, 2}, 2 → {1, 2}, 3 → {3, 4}, 4 → {3, 4, 5}, 5 → {4, 5}.

In terms of filtering ratios, our analysis shows that 52% of the reviews have a filtering ratio greater than 10%, 33% have a filtering ratio between 1% and 10%, and the remaining reviews have a filtering ratio of less than 1%.

W4: amazon_books

This dataset is an updated version of the Amazon review dataset released in 2014. As in the previous version, this dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features). The dataset can be accessed at Amazon Reviews Dataset.

In the “Books” category, 16 million data records were processed. After filtering for valid labels and the application of sklearn’s train_test_split method, a training dataset comprising 15,928,208 entries and a test dataset consisting of 10,000 entries were extracted. Subsequently, book reviews were converted into 384-dimensional vectors through the utilization of the all-MiniLM-L12-v2 embedding model.

Each vector was then associated with its corresponding metadata, including timestamp, text length, overall rating, helpfulness votes, and Amazon Standard Identification Number (ASIN) as labels. Timestamps were in Unix time, and the period of the past 7 days is used as a filtering condition. The text length range is [-30, +30]. The overall rating mapping rule is as follows: 1 → {1, 2}, 2 → {1, 2}, 3 → {3, 4}, 4 → {3, 4, 5}, 5 → {4, 5}. The ASIN is the book identifier, and the interval segment is set such that 10% of all identifiers fall into the same category of books.

In terms of query filtering ratios within this dataset, 40% of the queries have a filtering ratio between 0.1% and 1%, while the remaining 60% of the queries have a filtering ratio between 0.01% and 0.1%.

Dataset Data / Query Points Labels Embedding Model Dimension Distance Download Raw Data
ag_news-384-euclidean-filter 120,000 / 7,600 1 all-MiniLM-L12-v2 384 Euclidean link1, link2 ag_news
cc_news-384-euclidean-filter 620,643 / 10,000 1 all-MiniLM-L12-v2 384 Euclidean link1, link2 cc_news
app_reviews-384-euclidean-filter 277,936 / 10,000 3 all-MiniLM-L12-v2 384 Euclidean link1, link2 app_reviews
amazon-384-euclidean-5filter 15,928,208 / 10,000 5 all-MiniLM-L12-v2 384 Euclidean link1, link2 amazon

Multi-Modal Queries

W5: img-wikipedia

The raw dataset, available at Hugging Face - img-wikipedia-simple, contains 489,116 images from Wikipedia articles along with their corresponding text descriptions. From this dataset, we randomly selected 10,000 entries to serve as the test set, while the remaining 479,116 entries were designated as the training set. These entries were then embedded into 1,024-dimensional vectors using the ImageBind model.

W6: librispeech_asr

The LibriSpeech Automatic Speech Recognition (ASR) dataset, available at Hugging Face - librispeech_asr, is derived from read audiobooks of the LibriVox project, consisting of approximately 1,000 hours of 16kHz English speech, carefully segmented and aligned for Automatic Speech Recognition (ASR) research and development.

W7: gpt4vision

The raw dataset 220k-GPT4Vision-captions-from-LIVIS, available at Hugging Face - 220k-GPT4Vision-captions-from-LIVIS, consists of 217,868 captioned images sourced from the LVIS dataset. These captions were derived by summarizing the LVIS-Instruct4V dataset, with the instructions translated into captions using the Mistral-7B-OpenOrca model. The captions are designed to be precise and factual, reflecting the details of the images as discussed in a chatbot-user dialogue.

From this dataset, we randomly selected 10,000 entries to serve as the test set, while the remaining 207,868 entries were designated as the training set. These entries were then embedded into 1,024-dimensional vectors using the ImageBind model.

Dataset Data / Query Points Modal Embedding Model Dimension Distance Download Raw Data
img-wikipedia-1024-euclidean-mm 479,116 / 10,000 image ImageBind 1024 Euclidean link1, link2 img-wikipedia
librispeech_asr-1024-euclidean-mm 104,014 / 2,620 audio ImageBind 1024 Euclidean link1, link2 librispeech_asr
gpt4vision-1024-euclidean-mm 207,868 / 10,000 image ImageBind 1024 Euclidean link1, link2 gpt4vision

Multi-Vector Queries

W8: webvid

The WebVid dataset is a large-scale text-video dataset, meticulously curated to include 100,000 training samples and 10,000 testing samples. For each video in the dataset, we uniformly sampled four frames based on the video's length, ensuring that the frames are representative of the video's content across its duration. Each of these frames was then embedded using the CLIP-ViT-B-16 model to produce four 512-dimensional vectors. The original dataset is available at Hugging Face - webvid-10M.

Dataset Data / Query Points Vectors Embedding Model Dimension Distance Download Raw Data
webvid-4-512-euclidean 1,000,000 / 10,000 4 CLIP-ViT-B-16 512 Euclidean link1, link2 webvid

Big Queries

W9: dbpedia-entities (1536 dimension)

The dataset at Hugging Face - dbpedia-entities-openai3-text-embedding-3-large-1536-1M contains 1 million 1536-dimensional embeddings generated from the first million entries of the dbpedia-entity dataset using OpenAI’s text-embedding-3-large model. To facilitate testing, we randomly selected 10,000 of these embeddings to form the test set.

W10: dbpedia-entities (3072 dimension)

The dataset at Hugging Face - dbpedia-entities-openai3-text-embedding-3-large-3072-1M contains 1 million 3072-dimensional embeddings generated from the first million entries of the dbpedia-entity dataset using OpenAI’s text-embedding-3-large model. To facilitate testing, we randomly selected 10,000 of these embeddings to form the test set.

Dataset Data / Query Points Embedding Model Dimension Distance Download Raw Data
dbpedia-entities-openai3-text-embedding-3-large-1536-1000k-euclidean 990,000 / 10,000 OpenAI text-embedding-3-large 1536 Euclidean link1, link2 dbpedia-entities
dbpedia-entities-openai3-text-embedding-3-large-3072-1000k-euclidean 990,000 / 10,000 OpenAI text-embedding-3-large 3072 Euclidean link1, link2 dbpedia-entities

Contributing

For the development of BigVectorBench, we welcome contributions from the community. If you are interested in contributing to this project, please follow the .Guidelines for Contributing.

Results (Until July 2024)

  • ag_news-384-euclidean vs ag_news-384-euclidean-filter
  • cc_news-384-euclidean vs cc_news-384-euclidean-filter
  • app_reviews-384-euclidean vs app_reviews-384-euclidean-filter
  • amazon-384-euclidean-5 vs amazon-384-euclidean-5filter
  • img-wikipedia-1024-euclidean-mm
  • librispeech_asr-1024-euclidean-mm
  • gpt4vision-1024-euclidean-mm
  • webvid-4-512-euclidean
  • dbpedia-entities-openai3-text-embedding-3-large-1536-1000k-euclidean vs dbpedia-entities-openai3-text-embedding-3-large-3072-1000k-euclidean

Acknowledgements

Citation

About

BigVectorBench advances vector database benchmarking by defining and evaluating the embedding performance of heterogeneous data and abstracting compound queries, which can be multimodal or single-modal with fine-grained restrictions, for real-world applications.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 96.6%
  • Dockerfile 3.2%
  • Shell 0.2%