[FEATURE] Add new api about vector deletion via generated date not insert date #28

njuslj · 2019-10-17T08:34:35Z

Is your feature request related to a problem? Please describe.
I wish I could use Milvus to delete vectors via their generated date，the generated date means vector's actual production date that different from the vector's insert date.

For example : Some vectors come from pictures which generated between 2019-05-01 and 2019-05-25 , but we import the vectors to milvus database at 2019-06-01. If we want delete the vectors between 2019-05-01 and 2019-05-10 , existing apis can not support.

Describe the solution you'd like
Add a api which like "milvus.delete_vectors_by_range('test01', '2019-06-01', '2020-01-01')" but the date means vector's actual production date not vector's insert date.

yhmo · 2019-10-17T09:46:17Z

Thanks for your requirement.
So far Milvus doesn't allow user specify data partition logic, but we are actually planning this.
A possible solution is:
Extend the Insert api: insert(table_name, vector_list, vector_id, partition_hint), if user provide the partition_hint, the vectors will be stored into a partition folder
Add a new api: delete_vectors_by_partition(table_name, partition_hint), user call this api to delete vectors for certain partition.

The final solution is not yet decided, please feel free to tell us if you have any suggestion.
Thanks!

njuslj · 2019-10-17T11:31:17Z

Thanks for your solution.
If user specify data partition logic such as partition by date, will the recall rate or search speed decrease? Is there an upper limit on the number of partitions?

Looking forward to your reply, thanks!

yhmo · 2019-10-17T11:58:27Z

Theoretically, partition logic won't affect the recall rate. But could affect search performance. For instance, assume we have 10000 vectors, if we put them into 1000 partitions, each partition contain 10 vectors(too few vectors to build index), so the search action is a 'brute-force search'; but if we put them into one partition, we can build index for this partition to get best search performance.
In my opinion milvus shouldn't limit partition number. User had to take responsibility for a reasonable partition number.

njuslj · 2019-10-17T12:10:33Z

OK, if we put all vectors into 100 partitions by date, one partition have one million vectors, how much the search performance may decrease compare with only one partition probably?

yhmo · 2019-10-17T12:17:18Z

The performance is same. Since one million vectors will be split into small data files(each file is about 1GB in default).
Partition number could affect search performance only in the case that vector number is too few.

njuslj · 2019-10-17T12:24:21Z

Thanks!
"one million vectors will be split into small data files", Is this quantity of the small data files determined by parameter "nlist" ?

yhmo · 2019-10-17T12:30:18Z

For Milvus 0.3.x: it is defined by index_building_threshold in the server_config.yaml

For Milvus 0.4.x and 0.5.x: it is defined by create_table api. For python example:
create_table({'table_name': TABLE_NAME, 'dimension': TABLE_DIMENSION, 'index_file_size': 1024, 'metric_type':MetricType.L2})
The unit of 'index_file_size' is MB. Default value is 1024MB.

njuslj · 2019-10-17T12:40:02Z

OK, got it.
For Milvus 0.3.1, what's the effect of the parameter "nlist" in config file "server_config" ?

yhmo · 2019-10-18T01:36:43Z

The 'nlist' means split vectors into clusters within a file, after build index. Assume one file contains 10000 vectors, 'nlist' set to 200, then user perform 'build_index', the 10000 vectors will be split into 200 clusters(not equally), each cluster has an index.

njuslj · 2019-10-18T02:06:31Z

Is the recall rate sensitive to this parameter 'nlist'？

yhmo · 2019-10-18T02:36:45Z

There is another parameter 'nprobe' related to 'nlist'. The 'nprobe' is a search parameter, means how many cluster will be picked up to find topk result, 'nprobe' must always less-equal than 'nlist'. The two parameters can both affect search performance and recall rate.
Assume a file contains 10000 vectors.
If you set 'nlist'=1, 'nprobe'=1, that means all vectors in a single cluster and search engine will search all vectors in this cluster, the recall rate must be 100%, but the search performance is pool since all 10000 vectors were computed.
If you set 'nlist'=100, 'nprobe'=1, that means 10000 vectors split into 100 clusters, search engine firstly find the most closest cluster, then find topk in this cluster, the recall rate may less than 90%, but search performance is good.

njuslj · 2019-10-18T02:48:56Z

OK，If the parameter is set to 'nlist'=100, 'nprobe'=1 or set 'nlist'=100, 'nprobe'=10, how much difference will the query efficiency be？

yhmo · 2019-10-18T03:37:53Z

It is hard to say. A query performance is affected by many facets, including data swap, index parameters, search parameters, hardware ability, so on.

njuslj · 2019-10-18T03:46:23Z

If the above factors are the same, will the query time increase linearly as nprobe increases?

yhmo · 2019-10-18T06:31:07Z

Query time has several phase: collect/prepare index files, data load from disk to cpu, index compare, find topk in nprobe clusters, reduce to final result, serialize result and send to client, etc.
The nprobe parameter only affect one of the phases. Although this phase time cost is linearly depend by nprobe, the whole query time is not linearly.

njuslj · 2019-10-18T07:19:50Z

Thanks for your detailed analysis.

njuslj · 2019-10-18T07:31:05Z

Will milvus recall rates and performance change significantly on skewed and evenly distributed data sets?

yhmo · 2019-10-18T08:10:47Z

I don't think it can significantly affect recall rate and performance. But I intent to say evenly distributed data sets is a better practice.

njuslj · 2019-10-18T09:40:32Z

OK, according to previous usage, on a data set containing ten million vectors，nlist set to the default of 16384 .When nprobe is 1, return top1000, a cluster actually containing 50 vectors, which can recall 20 vectors with a recall rate of 40%; When nprobe increased to 100, the recall rate was 90%.In this case, should we increase the nlist and decrease the nprobe ?

yhmo · 2019-10-18T11:40:44Z

'index_file_size' default value is 1024MB. Assume the 10M vectors are 512 dimension, each file contains 500000 vectors. 'nlist' set to 16384, each cluster contain about 30 vectors. 'nprobe' set to 1, topk set to 1000, the single cluster could only contain 35 vectors, the result will like this:
id = 12340 distance = 0.0
id = 34743 distance = 71.00025939941406
..... 35 valid items
id = 63112 distance = 92.93685913085938
id = 98257 distance = 93.01753997802734
id = -1 distance = 3.4028234663852886e+38
id = -1 distance = 3.4028234663852886e+38
......
id = -1 distance = 3.4028234663852886e+38
..... 965 invalid items

It only return 35 valid items to client. So the recall rate is very pool.
To increase the recall rate, you need to increase 'nprobe'. The larger the 'nprobe', the higher recall rate. If 'nprobe' equals to 'nlist', recall rate is 100%.

njuslj · 2019-10-21T07:52:21Z

Thanks for your detailed analysis.
Looking forward the new api about vector deletion via generated date.

Best wishes!

yhmo · 2019-11-08T03:33:14Z

#77 'Support Table partition' already implemented in 0.6.0. Please wait 0.6.0 release.

add examples

Fix autoindex optimization

njuslj changed the title ~~[FEATURE]~~ [FEATURE] Add new api about vector deletion via vector's real generated date not insert date Oct 17, 2019

njuslj changed the title ~~[FEATURE] Add new api about vector deletion via vector's real generated date not insert date~~ [FEATURE] Add new api about vector deletion via vector's generated date not insert date Oct 17, 2019

njuslj changed the title ~~[FEATURE] Add new api about vector deletion via vector's generated date not insert date~~ [FEATURE] Add new api about vector deletion via generated date not insert date Oct 17, 2019

yhmo mentioned this issue Oct 22, 2019

[FEATURE] Support table partition #77

Closed

yhmo added the kind/enhancement Issues or changes related to enhancement label Oct 24, 2019

yhmo closed this as completed Nov 8, 2019

lale314 mentioned this issue Jan 13, 2020

Server crashed at startup in CentOS 7.6 #981

Closed

xuqinkun mentioned this issue Jul 21, 2021

Run rootcoord error #6685

Closed

qi49125 mentioned this issue Feb 18, 2022

[Bug]: The cluster cannot release the loaded collection or query #15623

Closed

1 task

jaime0815 pushed a commit to jaime0815/milvus that referenced this issue Nov 18, 2022

Merge pull request milvus-io#28 from youny626/branch-0.5.0

e8f01f6

add examples

yah01 pushed a commit to yah01/milvus that referenced this issue Feb 13, 2023

Fix cmake build on Mac and add Mac Github Action runner (milvus-io#28)

2954cf9

NicoYuan1986 mentioned this issue Jul 28, 2023

[Bug]: [Nightly]DataNode crash reporting error syncTimestamp Failed : find no available rootcoord #25976

Closed

1 task

chasingegg pushed a commit to chasingegg/milvus that referenced this issue May 29, 2024

Merge pull request milvus-io#28 from chasingegg/cluster-major

81abbdf

Fix autoindex optimization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Add new api about vector deletion via generated date not insert date #28

[FEATURE] Add new api about vector deletion via generated date not insert date #28

njuslj commented Oct 17, 2019

yhmo commented Oct 17, 2019

njuslj commented Oct 17, 2019

yhmo commented Oct 17, 2019

njuslj commented Oct 17, 2019

yhmo commented Oct 17, 2019

njuslj commented Oct 17, 2019 •

edited

Loading

yhmo commented Oct 17, 2019 •

edited

Loading

njuslj commented Oct 17, 2019

yhmo commented Oct 18, 2019 •

edited

Loading

njuslj commented Oct 18, 2019

yhmo commented Oct 18, 2019

njuslj commented Oct 18, 2019

yhmo commented Oct 18, 2019

njuslj commented Oct 18, 2019 •

edited

Loading

yhmo commented Oct 18, 2019

njuslj commented Oct 18, 2019

njuslj commented Oct 18, 2019

yhmo commented Oct 18, 2019

njuslj commented Oct 18, 2019 •

edited

Loading

yhmo commented Oct 18, 2019 •

edited

Loading

njuslj commented Oct 21, 2019 •

edited

Loading

yhmo commented Nov 8, 2019

[FEATURE] Add new api about vector deletion via generated date not insert date #28

[FEATURE] Add new api about vector deletion via generated date not insert date #28

Comments

njuslj commented Oct 17, 2019

yhmo commented Oct 17, 2019

njuslj commented Oct 17, 2019

yhmo commented Oct 17, 2019

njuslj commented Oct 17, 2019

yhmo commented Oct 17, 2019

njuslj commented Oct 17, 2019 • edited Loading

yhmo commented Oct 17, 2019 • edited Loading

njuslj commented Oct 17, 2019

yhmo commented Oct 18, 2019 • edited Loading

njuslj commented Oct 18, 2019

yhmo commented Oct 18, 2019

njuslj commented Oct 18, 2019

yhmo commented Oct 18, 2019

njuslj commented Oct 18, 2019 • edited Loading

yhmo commented Oct 18, 2019

njuslj commented Oct 18, 2019

njuslj commented Oct 18, 2019

yhmo commented Oct 18, 2019

njuslj commented Oct 18, 2019 • edited Loading

yhmo commented Oct 18, 2019 • edited Loading

njuslj commented Oct 21, 2019 • edited Loading

yhmo commented Nov 8, 2019

njuslj commented Oct 17, 2019 •

edited

Loading

yhmo commented Oct 17, 2019 •

edited

Loading

yhmo commented Oct 18, 2019 •

edited

Loading

njuslj commented Oct 18, 2019 •

edited

Loading

njuslj commented Oct 18, 2019 •

edited

Loading

yhmo commented Oct 18, 2019 •

edited

Loading

njuslj commented Oct 21, 2019 •

edited

Loading