Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Milvus embedding search with filtering does not work in the first 5-10 minutes #37098

Open
1 task done
kiranchitturi opened this issue Oct 24, 2024 · 22 comments
Open
1 task done
Assignees
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@kiranchitturi
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: 2.4.4
- Deployment mode(standalone or cluster): standalone and cluster
- MQ type(rocksmq, pulsar or kafka): kafka
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

I am seeing a weird behavior where for the first 5-10 minutes after the collection is created, the embedding search with scalar filtering returns irrelevant results (results that do not match the filtering criteria)

This resolves after some time (5-10) mins. What causes this behavior and how to remediate this?

I have seen this issue in standalone and deployed cluster too

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

@kiranchitturi kiranchitturi added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 24, 2024
@yanliang567
Copy link
Contributor

@kiranchitturi quick questions:

  1. are the irrelevant results just inserted a few seconds ago?
  2. do you have duplicated pk values in milvus?
  3. could you please retry on milvus 2.4.13-hotfix? if it reproduced, please help to attach the milvus logs.

/assign @kiranchitturi
/unassign

@yanliang567 yanliang567 added triage/needs-information Indicates an issue needs more information in order to work on it. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 24, 2024
@yanliang567
Copy link
Contributor

in case you don't know how to collect the milvus logs:
Please refer this doc to export the whole Milvus logs for investigation.
For Milvus installed with docker-compose, you can use docker-compose logs > milvus.log to export the logs.

@kiranchitturi
Copy link
Author

Thanks for the quick reply @yanliang567!

are the irrelevant results just inserted a few seconds ago?

The whole dataset was inserted just few minutes ago. So, this happens after collection is created, data is inserted and for first 5-10 mins filtered data is irrelevant

do you have duplicated pk values in milvus?
I don't think so but I will check on this

could you please retry on milvus 2.4.13-hotfix? if it reproduced, please help to attach the milvus logs.
I will try to reproduce on 2.4.4 independently without our usecase first and then try 2.4.13-hotfix

What's surprising to me is that the queries work fine after 10 mins and no issues

@xiaofan-luan
Copy link
Collaborator

I don't think that really make sense.

  1. could you share how you do insert and search if possible? what is the distance you get in the first couple of minutes, and what is the distance you get later?
  2. if you are still doing poc, would you mind upgrade to 2.4.13 and check?

@kiranchitturi
Copy link
Author

@xiaofan-luan I am able to reproduce it consistently now with below sample code (dummy data) and version 2.4.4. Unfortunately, we are stuck on 2.4.4 for now till we can upgrade our clusters to latest bug fix version

  1. Client import and init
from pymilvus import (
    utility,
    FieldSchema, CollectionSchema, DataType, MilvusClient,
    Collection, AnnSearchRequest, RRFRanker, connections,
)
from pymilvus.client.constants import ConsistencyLevel
import random
import json
# 1. Set up a Milvus client
client = MilvusClient(
    uri="http://localhost:19530"
)
  1. Create data
data = []
colors = ["green", "blue", "yellow", "red", "black", "white", "purple", "pink", "orange", "brown", "grey"]
for i in range(70000):
    current_color = random.choice(colors)
    doc = {
        "id": str(i),
        "embeddings": [ random.uniform(-1, 1) for _ in range(768) ],
        "title": current_color,
        "doc_type": "type_1"
    }
    if i > 69500:
        doc["doc_type"] = "type_2"
    doc["uri"] = "uri" + str(int(i/1000))
    data.append(doc)
  1. Insert data
create_collection(client, "quick_test_v0", 768)
for i in range(0, len(data), 1000):
    print(f"indexing docs from {i} to {i+1000}")
    client.insert(collection_name="quick_test_v0", data=data[i:i+10000])
client.create_alias("quick_test_v0", "quick_test")
print(res)
  1. Query
query_results = client.search(
    collection_name="quick_test",
    data=[data[-1]["embeddings"]],
    limit=5,
    filter=f"doc_type == 'type_2'",
    group_by_field="uri",
    output_fields=["id", "title", "doc_type", "uri"],
    search_params={"metric_type": "COSINE", "params": {}}
)[0]
print(query_results)

returns

[{'id': '69103', 'distance': 0.9999996423721313, 'entity': {'id': '69103', 'title': 'pink', 'doc_type': 'type_1', 'uri': 'uri69'}}]

The code returned type_1 document even though I explicitly asked for type_2 document

@kiranchitturi
Copy link
Author

the above returned document shouldn't have that score. If I filter on that exact document, the relevancy is very low

query_results = client.search(
    collection_name="quick_test",
    data=[data[-1]["embeddings"]],
    limit=5,
    filter=f"id == '69103'",
    group_by_field="uri",
    output_fields=["id", "title", "doc_type", "uri"],
    search_params={"metric_type": "COSINE", "params": {}}
)[0]
print(query_results)
[{'id': '69103', 'distance': -0.009151088073849678, 'entity': {'id': '69103', 'title': 'pink', 'doc_type': 'type_1', 'uri': 'uri69'}}]

@kiranchitturi
Copy link
Author

kiranchitturi commented Oct 25, 2024

Interestingly, if I update the filter to filter=f"id == '69999'", it returns wrong document

[{'id': '69103', 'distance': 0.9999998807907104, 'entity': {'doc_type': 'type_1', 'uri': 'uri69', 'id': '69103', 'title': 'pink'}}]

The score is what I would expect from 69999 but nothing else. Somewhere the association mapping is failing

No of type_2 docs are less than 500 and all searches are returning only type_1 documents

@xiaofan-luan
Copy link
Collaborator

@yanliang567 could you try to reproduce this issue on later milvus version?

@yanliang567
Copy link
Contributor

@kiranchitturi quick questions:
in the insert data part, it seems that you are inserting data with duplicated ids. it is not recommended as Milvus does not do de-dup for pk for now, and it makes search results not stable.

client.insert(collection_name="quick_test_v0", data=data[i:i+10000]) # i+10000 causes insertion with dup ids

@yanliang567
Copy link
Contributor

It proves to be a pymilvus client issue. I can reproduce it with pymilvus client scripts, while i can not reproduce it with pymilvus orm scripts.
@XuanYang-cn could you please help to take a look, does pymilvus client always uses consistency level Strong?

/assign @XuanYang-cn
/unassign

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed triage/needs-information Indicates an issue needs more information in order to work on it. labels Oct 26, 2024
@yanliang567 yanliang567 added this to the 2.4.14 milestone Oct 26, 2024
@kiranchitturi
Copy link
Author

@yanliang567 good catch on the duplication, there was a typo there. were you able to reproduce it with the latest version and fixing the duplicates issue?

Do you think it happens due to the consistency level?

@yanliang567
Copy link
Contributor

I think it is due to the consistency level. In case it blocks you on poc, you can use pymilvus orm instead for now.

@kiranchitturi
Copy link
Author

I think it is due to the consistency level. In case it blocks you on poc, you can use pymilvus orm instead for now.

I will try with that. It's interesting that it's a client issue or a server issue. Do you think it's an issue with jdbc calls vs rest api calls?

@xiaofan-luan
Copy link
Collaborator

I think it is due to the consistency level. In case it blocks you on poc, you can use pymilvus orm instead for now.

How this related to consistency? filter is doc_type == 'type_2' but they get doc_type = 'type_1'

@xiaofan-luan
Copy link
Collaborator

create_collection(client, "quick_test_v0", 768)
for i in range(0, len(data), 1000):
print(f"indexing docs from {i} to {i+1000}")
client.insert(collection_name="quick_test_v0", data=data[i:i+10000])
client.create_alias("quick_test_v0", "quick_test")
print(res)

@kiranchitturi

I think this is definitely a bug here
you try to iterate with 1000 but each time you insert 10000 data

@xiaofan-luan
Copy link
Collaborator

I think this is related to how you generate your data, it's better to check what's actually in your database.

I actaully didn't heard from any feedback that there is any strange bug like this so my my suggestion is to carefully check your scripts.

also we don't recommend to use random embeddings for recall test "embeddings": [ random.uniform(-1, 1) for _ in range(768) ] because it doesn't really match the real world use case

@kiranchitturi
Copy link
Author

I think this is related to how you generate your data, it's better to check what's actually in your database.
yes, that's what I was referring to earlier. if you were able to replicate it after fixing the duplication bug. I have not seen the bug after replication but let me try it again

Is it better to load the collection after bulk indexing the data or creating the collection?

@kiranchitturi
Copy link
Author

also we don't recommend to use random embeddings for recall test "embeddings": [ random.uniform(-1, 1) for _ in range(768) ] because it doesn't really match the real world use case

I have used this only for replicating the bug bcoz I can't share my real scripts

@yanliang567
Copy link
Contributor

yanliang567 commented Oct 28, 2024

I think this is related to how you generate your data, it's better to check what's actually in your database.
yes, that's what I was referring to earlier. if you were able to replicate it after fixing the duplication bug. I have not seen the bug after replication but let me try it again

Is it better to load the collection after bulk indexing the data or creating the collection?

you have to load the collection after creating index, or milvus returns errors.
Also, please upgrade milvus and pymilvus to latest release, and retry. It works well for the filtering on milvus 2.4.13-hotfix and pymilvus 2.4.8

@xiaofan-luan
Copy link
Collaborator

also we don't recommend to use random embeddings for recall test "embeddings": [ random.uniform(-1, 1) for _ in range(768) ] because it doesn't really match the real world use case

I have used this only for replicating the bug bcoz I can't share my real scripts

I'm doubting there is some other bugs in your code so please check on it carefully.
right now from all the information I get this is not a milvus bug but it's highly likely there is a bug on writing data into the system

@XuanYang-cn
Copy link
Contributor

Seems not a pymilvus issue
/unassign

@yanliang567
Copy link
Contributor

@kiranchitturi any chance that you had tried on latest pymilvus and milvus?

@yanliang567 yanliang567 modified the milestones: 2.4.14, 2.4.16 Nov 14, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.16, 2.4.17, 2.4.18 Nov 21, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.18, 2.4.19, 2.4.20 Dec 24, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.20, 2.4.21 Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants