-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Add new api about vector deletion via generated date not insert date #28
Comments
Thanks for your requirement. The final solution is not yet decided, please feel free to tell us if you have any suggestion. |
Thanks for your solution. Looking forward to your reply, thanks! |
Theoretically, partition logic won't affect the recall rate. But could affect search performance. For instance, assume we have 10000 vectors, if we put them into 1000 partitions, each partition contain 10 vectors(too few vectors to build index), so the search action is a 'brute-force search'; but if we put them into one partition, we can build index for this partition to get best search performance. |
OK, if we put all vectors into 100 partitions by date, one partition have one million vectors, how much the search performance may decrease compare with only one partition probably? |
The performance is same. Since one million vectors will be split into small data files(each file is about 1GB in default). |
Thanks! |
For Milvus 0.3.x: it is defined by For Milvus 0.4.x and 0.5.x: it is defined by create_table api. For python example: |
OK, got it. |
The 'nlist' means split vectors into clusters within a file, after build index. Assume one file contains 10000 vectors, 'nlist' set to 200, then user perform 'build_index', the 10000 vectors will be split into 200 clusters(not equally), each cluster has an index. |
Is the recall rate sensitive to this parameter 'nlist'? |
There is another parameter 'nprobe' related to 'nlist'. The 'nprobe' is a search parameter, means how many cluster will be picked up to find topk result, 'nprobe' must always less-equal than 'nlist'. The two parameters can both affect search performance and recall rate. |
OK,If the parameter is set to 'nlist'=100, 'nprobe'=1 or set 'nlist'=100, 'nprobe'=10, how much difference will the query efficiency be? |
It is hard to say. A query performance is affected by many facets, including data swap, index parameters, search parameters, hardware ability, so on. |
If the above factors are the same, will the query time increase linearly as nprobe increases? |
Query time has several phase: collect/prepare index files, data load from disk to cpu, index compare, find topk in nprobe clusters, reduce to final result, serialize result and send to client, etc. |
Thanks for your detailed analysis. |
Will milvus recall rates and performance change significantly on skewed and evenly distributed data sets? |
I don't think it can significantly affect recall rate and performance. But I intent to say evenly distributed data sets is a better practice. |
OK, according to previous usage, on a data set containing ten million vectors,nlist set to the default of 16384 .When nprobe is 1, return top1000, a cluster actually containing 50 vectors, which can recall 20 vectors with a recall rate of 40%; When nprobe increased to 100, the recall rate was 90%.In this case, should we increase the nlist and decrease the nprobe ? |
'index_file_size' default value is 1024MB. Assume the 10M vectors are 512 dimension, each file contains 500000 vectors. 'nlist' set to 16384, each cluster contain about 30 vectors. 'nprobe' set to 1, topk set to 1000, the single cluster could only contain 35 vectors, the result will like this: It only return 35 valid items to client. So the recall rate is very pool. |
Thanks for your detailed analysis. Best wishes! |
#77 'Support Table partition' already implemented in 0.6.0. Please wait 0.6.0 release. |
Fix autoindex optimization
Is your feature request related to a problem? Please describe.
I wish I could use Milvus to delete vectors via their generated date,the generated date means vector's actual production date that different from the vector's insert date.
For example : Some vectors come from pictures which generated between 2019-05-01 and 2019-05-25 , but we import the vectors to milvus database at 2019-06-01. If we want delete the vectors between 2019-05-01 and 2019-05-10 , existing apis can not support.
Describe the solution you'd like
Add a api which like "milvus.delete_vectors_by_range('test01', '2019-06-01', '2020-01-01')" but the date means vector's actual production date not vector's insert date.
The text was updated successfully, but these errors were encountered: