VectorDB hosted solution takes a lot of time to push vectors #51
Description
I tried to make use of vectordb's hosted provision from jina ai, using commands mentioned in the docs
from docarray import DocList, BaseDoc
from docarray.typing import NdArray
from vectordb import HNSWVectorDB
import time
import glob
class LogoDoc(BaseDoc):
embedding: NdArray[768]
id: str
db = HNSWVectorDB[LogoDoc](
workspace="hnsw_vectordb",
space = "ip",
max_elements = 2700000,
ef_construction = 256,
M = 16,
num_threads = 8
)
if __name__=="__main__" :
with db.serve() as service :
service.block()
and tried to push my vectors using the client interface
I have a collection 2.5M 768 dimensional vectors to be stored in the db, so I decided to make batched calls of db.index method with 64k vectors in each call. The code didnt respond to the same, so i tried to change the batch size to 2, the code was able to index at a speed of 5 s/it and the estimated time taken was 27 hours. ( I assume this is happening since the tree construction is happening during each index call)
It would be nice if we could speedup the process by asking the user to push all the documents at first and then perform tree construction upon another specific api call
db.push_documents([doc1 , doc2, doc3, ...])
db.build_tree()
which could replace the
db.index()
and during the build process we could easily block the crud operations with a is_building_tree
flag and throw an error named TreeCurrentlyBuildingError() when crud operations are being performed