[Bug]: Milvus embedding search with filtering does not work in the first 5-10 minutes #37098

kiranchitturi · 2024-10-24T05:59:09Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version: 2.4.4
- Deployment mode(standalone or cluster): standalone and cluster
- MQ type(rocksmq, pulsar or kafka): kafka
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

I am seeing a weird behavior where for the first 5-10 minutes after the collection is created, the embedding search with scalar filtering returns irrelevant results (results that do not match the filtering criteria)

This resolves after some time (5-10) mins. What causes this behavior and how to remediate this?

I have seen this issue in standalone and deployed cluster too

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

yanliang567 · 2024-10-24T08:41:37Z

@kiranchitturi quick questions:

are the irrelevant results just inserted a few seconds ago?
do you have duplicated pk values in milvus?
could you please retry on milvus 2.4.13-hotfix? if it reproduced, please help to attach the milvus logs.

/assign @kiranchitturi
/unassign

yanliang567 · 2024-10-24T08:42:14Z

in case you don't know how to collect the milvus logs:
Please refer this doc to export the whole Milvus logs for investigation.
For Milvus installed with docker-compose, you can use docker-compose logs > milvus.log to export the logs.

kiranchitturi · 2024-10-24T13:58:23Z

Thanks for the quick reply @yanliang567!

are the irrelevant results just inserted a few seconds ago?

The whole dataset was inserted just few minutes ago. So, this happens after collection is created, data is inserted and for first 5-10 mins filtered data is irrelevant

do you have duplicated pk values in milvus?
I don't think so but I will check on this

could you please retry on milvus 2.4.13-hotfix? if it reproduced, please help to attach the milvus logs.
I will try to reproduce on 2.4.4 independently without our usecase first and then try 2.4.13-hotfix

What's surprising to me is that the queries work fine after 10 mins and no issues

xiaofan-luan · 2024-10-25T03:33:15Z

I don't think that really make sense.

could you share how you do insert and search if possible? what is the distance you get in the first couple of minutes, and what is the distance you get later?
if you are still doing poc, would you mind upgrade to 2.4.13 and check?

kiranchitturi · 2024-10-25T20:49:57Z

@xiaofan-luan I am able to reproduce it consistently now with below sample code (dummy data) and version 2.4.4. Unfortunately, we are stuck on 2.4.4 for now till we can upgrade our clusters to latest bug fix version

Client import and init

from pymilvus import (
    utility,
    FieldSchema, CollectionSchema, DataType, MilvusClient,
    Collection, AnnSearchRequest, RRFRanker, connections,
)
from pymilvus.client.constants import ConsistencyLevel
import random
import json
# 1. Set up a Milvus client
client = MilvusClient(
    uri="http://localhost:19530"
)

Create data

data = []
colors = ["green", "blue", "yellow", "red", "black", "white", "purple", "pink", "orange", "brown", "grey"]
for i in range(70000):
    current_color = random.choice(colors)
    doc = {
        "id": str(i),
        "embeddings": [ random.uniform(-1, 1) for _ in range(768) ],
        "title": current_color,
        "doc_type": "type_1"
    }
    if i > 69500:
        doc["doc_type"] = "type_2"
    doc["uri"] = "uri" + str(int(i/1000))
    data.append(doc)

Insert data

create_collection(client, "quick_test_v0", 768)
for i in range(0, len(data), 1000):
    print(f"indexing docs from {i} to {i+1000}")
    client.insert(collection_name="quick_test_v0", data=data[i:i+10000])
client.create_alias("quick_test_v0", "quick_test")
print(res)

Query

query_results = client.search(
    collection_name="quick_test",
    data=[data[-1]["embeddings"]],
    limit=5,
    filter=f"doc_type == 'type_2'",
    group_by_field="uri",
    output_fields=["id", "title", "doc_type", "uri"],
    search_params={"metric_type": "COSINE", "params": {}}
)[0]
print(query_results)

returns

[{'id': '69103', 'distance': 0.9999996423721313, 'entity': {'id': '69103', 'title': 'pink', 'doc_type': 'type_1', 'uri': 'uri69'}}]

The code returned type_1 document even though I explicitly asked for type_2 document

kiranchitturi · 2024-10-25T21:02:12Z

the above returned document shouldn't have that score. If I filter on that exact document, the relevancy is very low

query_results = client.search(
    collection_name="quick_test",
    data=[data[-1]["embeddings"]],
    limit=5,
    filter=f"id == '69103'",
    group_by_field="uri",
    output_fields=["id", "title", "doc_type", "uri"],
    search_params={"metric_type": "COSINE", "params": {}}
)[0]
print(query_results)

[{'id': '69103', 'distance': -0.009151088073849678, 'entity': {'id': '69103', 'title': 'pink', 'doc_type': 'type_1', 'uri': 'uri69'}}]

kiranchitturi · 2024-10-25T21:04:26Z

Interestingly, if I update the filter to filter=f"id == '69999'", it returns wrong document

[{'id': '69103', 'distance': 0.9999998807907104, 'entity': {'doc_type': 'type_1', 'uri': 'uri69', 'id': '69103', 'title': 'pink'}}]

The score is what I would expect from 69999 but nothing else. Somewhere the association mapping is failing

No of type_2 docs are less than 500 and all searches are returning only type_1 documents

xiaofan-luan · 2024-10-26T01:42:25Z

@yanliang567 could you try to reproduce this issue on later milvus version?

yanliang567 · 2024-10-26T08:58:16Z

@kiranchitturi quick questions:
in the insert data part, it seems that you are inserting data with duplicated ids. it is not recommended as Milvus does not do de-dup for pk for now, and it makes search results not stable.

client.insert(collection_name="quick_test_v0", data=data[i:i+10000]) # i+10000 causes insertion with dup ids

yanliang567 · 2024-10-26T12:47:09Z

It proves to be a pymilvus client issue. I can reproduce it with pymilvus client scripts, while i can not reproduce it with pymilvus orm scripts.
@XuanYang-cn could you please help to take a look, does pymilvus client always uses consistency level Strong?

/assign @XuanYang-cn
/unassign

kiranchitturi · 2024-10-26T16:59:24Z

@yanliang567 good catch on the duplication, there was a typo there. were you able to reproduce it with the latest version and fixing the duplicates issue?

Do you think it happens due to the consistency level?

yanliang567 · 2024-10-27T01:38:51Z

I think it is due to the consistency level. In case it blocks you on poc, you can use pymilvus orm instead for now.

kiranchitturi · 2024-10-27T21:41:17Z

I think it is due to the consistency level. In case it blocks you on poc, you can use pymilvus orm instead for now.

I will try with that. It's interesting that it's a client issue or a server issue. Do you think it's an issue with jdbc calls vs rest api calls?

xiaofan-luan · 2024-10-27T22:13:56Z

I think it is due to the consistency level. In case it blocks you on poc, you can use pymilvus orm instead for now.

How this related to consistency? filter is doc_type == 'type_2' but they get doc_type = 'type_1'

xiaofan-luan · 2024-10-27T22:17:40Z

create_collection(client, "quick_test_v0", 768)
for i in range(0, len(data), 1000):
print(f"indexing docs from {i} to {i+1000}")
client.insert(collection_name="quick_test_v0", data=data[i:i+10000])
client.create_alias("quick_test_v0", "quick_test")
print(res)

@kiranchitturi

I think this is definitely a bug here
you try to iterate with 1000 but each time you insert 10000 data

xiaofan-luan · 2024-10-27T22:22:48Z

I think this is related to how you generate your data, it's better to check what's actually in your database.

I actaully didn't heard from any feedback that there is any strange bug like this so my my suggestion is to carefully check your scripts.

also we don't recommend to use random embeddings for recall test "embeddings": [ random.uniform(-1, 1) for _ in range(768) ] because it doesn't really match the real world use case

kiranchitturi · 2024-10-27T23:09:10Z

I think this is related to how you generate your data, it's better to check what's actually in your database.
yes, that's what I was referring to earlier. if you were able to replicate it after fixing the duplication bug. I have not seen the bug after replication but let me try it again

Is it better to load the collection after bulk indexing the data or creating the collection?

kiranchitturi · 2024-10-27T23:10:45Z

also we don't recommend to use random embeddings for recall test "embeddings": [ random.uniform(-1, 1) for _ in range(768) ] because it doesn't really match the real world use case

I have used this only for replicating the bug bcoz I can't share my real scripts

yanliang567 · 2024-10-28T03:40:57Z

I think this is related to how you generate your data, it's better to check what's actually in your database.
yes, that's what I was referring to earlier. if you were able to replicate it after fixing the duplication bug. I have not seen the bug after replication but let me try it again

Is it better to load the collection after bulk indexing the data or creating the collection?

you have to load the collection after creating index, or milvus returns errors.
Also, please upgrade milvus and pymilvus to latest release, and retry. It works well for the filtering on milvus 2.4.13-hotfix and pymilvus 2.4.8

xiaofan-luan · 2024-10-28T20:28:45Z

also we don't recommend to use random embeddings for recall test "embeddings": [ random.uniform(-1, 1) for _ in range(768) ] because it doesn't really match the real world use case

I have used this only for replicating the bug bcoz I can't share my real scripts

I'm doubting there is some other bugs in your code so please check on it carefully.
right now from all the information I get this is not a milvus bug but it's highly likely there is a bug on writing data into the system

XuanYang-cn · 2024-11-04T07:45:53Z

Seems not a pymilvus issue
/unassign

yanliang567 · 2024-11-04T07:55:15Z

@kiranchitturi any chance that you had tried on latest pymilvus and milvus?

kiranchitturi added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 24, 2024

kiranchitturi assigned yanliang567 Oct 24, 2024

sre-ci-robot assigned kiranchitturi and unassigned yanliang567 Oct 24, 2024

yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 24, 2024

sre-ci-robot assigned XuanYang-cn Oct 26, 2024

yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed triage/needs-information Indicates an issue needs more information in order to work on it. labels Oct 26, 2024

yanliang567 added this to the 2.4.14 milestone Oct 26, 2024

sre-ci-robot unassigned XuanYang-cn Nov 4, 2024

yanliang567 modified the milestones: 2.4.14, 2.4.16 Nov 14, 2024

yanliang567 modified the milestones: 2.4.16, 2.4.17, 2.4.18 Nov 21, 2024

yanliang567 modified the milestones: 2.4.18, 2.4.19, 2.4.20 Dec 24, 2024

yanliang567 modified the milestones: 2.4.20, 2.4.21 Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Milvus embedding search with filtering does not work in the first 5-10 minutes #37098

[Bug]: Milvus embedding search with filtering does not work in the first 5-10 minutes #37098

kiranchitturi commented Oct 24, 2024

yanliang567 commented Oct 24, 2024

yanliang567 commented Oct 24, 2024

kiranchitturi commented Oct 24, 2024

xiaofan-luan commented Oct 25, 2024

kiranchitturi commented Oct 25, 2024

kiranchitturi commented Oct 25, 2024

kiranchitturi commented Oct 25, 2024 •

edited

Loading

xiaofan-luan commented Oct 26, 2024

yanliang567 commented Oct 26, 2024

yanliang567 commented Oct 26, 2024

kiranchitturi commented Oct 26, 2024

yanliang567 commented Oct 27, 2024

kiranchitturi commented Oct 27, 2024

xiaofan-luan commented Oct 27, 2024

xiaofan-luan commented Oct 27, 2024

xiaofan-luan commented Oct 27, 2024

kiranchitturi commented Oct 27, 2024

kiranchitturi commented Oct 27, 2024

yanliang567 commented Oct 28, 2024 •

edited

Loading

xiaofan-luan commented Oct 28, 2024

XuanYang-cn commented Nov 4, 2024

yanliang567 commented Nov 4, 2024

[Bug]: Milvus embedding search with filtering does not work in the first 5-10 minutes #37098

[Bug]: Milvus embedding search with filtering does not work in the first 5-10 minutes #37098

Comments

kiranchitturi commented Oct 24, 2024

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

yanliang567 commented Oct 24, 2024

yanliang567 commented Oct 24, 2024

kiranchitturi commented Oct 24, 2024

xiaofan-luan commented Oct 25, 2024

kiranchitturi commented Oct 25, 2024

kiranchitturi commented Oct 25, 2024

kiranchitturi commented Oct 25, 2024 • edited Loading

xiaofan-luan commented Oct 26, 2024

yanliang567 commented Oct 26, 2024

yanliang567 commented Oct 26, 2024

kiranchitturi commented Oct 26, 2024

yanliang567 commented Oct 27, 2024

kiranchitturi commented Oct 27, 2024

xiaofan-luan commented Oct 27, 2024

xiaofan-luan commented Oct 27, 2024

xiaofan-luan commented Oct 27, 2024

kiranchitturi commented Oct 27, 2024

kiranchitturi commented Oct 27, 2024

yanliang567 commented Oct 28, 2024 • edited Loading

xiaofan-luan commented Oct 28, 2024

XuanYang-cn commented Nov 4, 2024

yanliang567 commented Nov 4, 2024

kiranchitturi commented Oct 25, 2024 •

edited

Loading

yanliang567 commented Oct 28, 2024 •

edited

Loading