Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [new_indexes] The searched results become less than limit * group_size after creating the new HNSW indexes after groupby search with group size #37601

Closed
1 task done
binbinlv opened this issue Nov 12, 2024 · 3 comments
Assignees
Labels
2.5-features ci/bug kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@binbinlv
Copy link
Contributor

binbinlv commented Nov 12, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: master latest
- Deployment mode(standalone or cluster): both
- MQ type(rocksmq, pulsar or kafka):    all
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus latest
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

The searched results become less after creating the new HNSW indexes after groupby search with group size


[pytest : test] self = <test_mix_scenes.TestGroupSearchNewHNSWIndex object at 0x7fd6a7bfc760>

[pytest : test] group_by_field = 'VARCHAR'

[pytest : test] 

[pytest : test]     @pytest.mark.tags(CaseLabel.L0)

[pytest : test]     @pytest.mark.parametrize("group_by_field", [DataType.VARCHAR.name, "varchar_inverted"])

[pytest : test]     def test_search_group_size_new_hnsw_index(self, group_by_field):

[pytest : test]         """

[pytest : test]         target:

[pytest : test]             1. search on 4 different float vector fields with group by varchar field with group size

[pytest : test]         verify results entity = limit * group_size  and group size is full if group_strict_size is True

[pytest : test]         verify results group counts = limit if group_strict_size is False

[pytest : test]         """

[pytest : test]         nq = 2

[pytest : test]         limit = 50

[pytest : test]         group_size = 5

[pytest : test]         for j in range(len(self.vector_fields)):

[pytest : test]             search_vectors = cf.gen_vectors(nq, dim=self.dims[j], vector_data_type=self.vector_fields[j])

[pytest : test]             search_params = {"params": cf.get_search_params_params(self.index_types[j])}

[pytest : test]             # when group_strict_size=true, it shall return results with entities = limit * group_size

[pytest : test]             res1 = self.collection_wrap.search(data=search_vectors, anns_field=self.vector_fields[j],

[pytest : test]                                                param=search_params, limit=limit,

[pytest : test]                                                group_by_field=group_by_field,

[pytest : test]                                                group_size=group_size, group_strict_size=True,

[pytest : test]                                                output_fields=[group_by_field])[0]

[pytest : test]             for i in range(nq):

[pytest : test] >               assert len(res1[i]) == limit * group_size

[pytest : test] E               assert 63 == (50 * 5)

Expected Behavior

The searched results are equal with limit * group_size

Steps To Reproduce

    @pytest.mark.tags(CaseLabel.L0)
    @pytest.mark.parametrize("group_by_field", [DataType.VARCHAR.name, "varchar_inverted"])
    def test_search_group_size_new_hnsw_index(self, group_by_field):
        """
        target:
            1. search on 4 different float vector fields with group by varchar field with group size
        verify results entity = limit * group_size  and group size is full if group_strict_size is True
        verify results group counts = limit if group_strict_size is False
        """
        nq = 2
        limit = 50
        group_size = 5
        for j in range(len(self.vector_fields)):
            search_vectors = cf.gen_vectors(nq, dim=self.dims[j], vector_data_type=self.vector_fields[j])
            search_params = {"params": cf.get_search_params_params(self.index_types[j])}
            # when group_strict_size=true, it shall return results with entities = limit * group_size
            res1 = self.collection_wrap.search(data=search_vectors, anns_field=self.vector_fields[j],
                                               param=search_params, limit=limit,
                                               group_by_field=group_by_field,
                                               group_size=group_size, group_strict_size=True,
                                               output_fields=[group_by_field])[0]
            for i in range(nq):
                assert len(res1[i]) == limit * group_size
                for l in range(limit):
                    group_values = []
                    for k in range(group_size):
                        group_values.append(res1[i][l*group_size+k].fields.get(group_by_field))
                    assert len(set(group_values)) == 1

            # when group_strict_size=false, it shall return results with group counts = limit
            res1 = self.collection_wrap.search(data=search_vectors, anns_field=self.vector_fields[j],
                                               param=search_params, limit=limit,
                                               group_by_field=group_by_field,
                                               group_size=group_size, group_strict_size=False,
                                               output_fields=[group_by_field])[0]
            for i in range(nq):
                group_values = []
                for l in range(len(res1[i])):
                    group_values.append(res1[i][l].fields.get(group_by_field))
                assert len(set(group_values)) == limit

Milvus Log

test log: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20HA%20CI/detail/PR-37136/14/pipeline/

milvus log: artifacts-milvus-standalone-ms-37136-14-py-pr-37136-14-e2e-logs.tar.gz

Anything else?

collection name: TestGroupSearchNewHNSWIndex_GflxWGH7
index on the searched field is: {'index_type': 'FAISS_HNSW_SQ', 'params': {'sq_type': 'SQ8'}, 'metric_type': 'IP'}

No response

@binbinlv binbinlv added kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on. ci/bug 2.5-features labels Nov 12, 2024
@binbinlv binbinlv added this to the 2.5.0 milestone Nov 12, 2024
@foxspy
Copy link
Contributor

foxspy commented Nov 12, 2024

/assign

@yanliang567
Copy link
Contributor

could be caused by the search params change from group_strict_size to strict_group_size

@binbinlv
Copy link
Contributor Author

Verified and fixed:

when change "group_strict_size" to "strict_group_size", it passes.

milvus: master-20241114-1304b405-amd64
pymilvus: 2.5.0rc119

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.5-features ci/bug kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

5 participants