Description
Project idea 2: DocArray wrap ANN libraries
Info | details |
---|---|
Skills needed | Python, ANN Search experience |
Project size | 175 hours |
Difficulty level | Medium |
Mentors | @Johannes Messner, @Sami Jaghouar, @Philip Vollet |
Project Description
-
In DocArray, we have been concentrating on developing production-ready Vector DBs for large-scale searches. However, there are many ANN libraries without scalability layers that can be integrated into DocArray, making it accessible to academia and production teams with small-to-medium amounts of data, without the need for external services.
-
DocArray v2 will have a concept called Document Index. This is an abstraction that lets a user store their Documents (on disk or in a database), and retrieve them using ANN search. As such, there can be multiple Document Indexes backed by different backends: Elastic, Qdrant, Weaviat, ...., but all following the same basic API.
-
The idea behind this project is to take an ANN library and use it to implement a Document Index. There is already an implementation using HNSWLib that you can find here: feat: hnswlib document index docarray/docarray#1124, But there is space to create similar backends using other libraries: Annoy, Faiss, ... The goal is to provide user choice.
-
If there is interest, someone could also implement a backend using a vector database. We already have Qdrant, Weaviate, and Elastic covered, but Milvus, Redis, and some others could also be interesting. You can find a design doc for Document Index here.
Expected outcomes
- We have a set of DocStores implementations in DocArray that support the most popular ANN libraries, such as FAISS, Annoy, and Hnswlib.