Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [benchmark][cluster][INVERTED] queryNode panics when loading INVERTED index file #29654

Closed
1 task done
wangting0128 opened this issue Jan 3, 2024 · 4 comments
Closed
1 task done
Assignees
Labels
kind/bug Issues or changes related a bug test/benchmark benchmark test triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@wangting0128
Copy link
Contributor

wangting0128 commented Jan 3, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: master-20231231-3f46c6d4
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): 2.4.0rc12
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

argo task:fouramf-multi-vector-cgv69

test case:test_inverted_locust_hnsw_diskann_dml_dql_cluster

server:

NAME                                                              READY   STATUS             RESTARTS           AGE     IP              NODE         NOMINATED NODE   READINESS GATES
fouramf-multi-vr-cgv69-9-9756-etcd-0                              1/1     Running            0                  3h28m   10.104.27.53    4am-node31   <none>           <none>
fouramf-multi-vr-cgv69-9-9756-etcd-1                              1/1     Running            0                  3h28m   10.104.28.181   4am-node33   <none>           <none>
fouramf-multi-vr-cgv69-9-9756-etcd-2                              1/1     Running            0                  3h28m   10.104.26.117   4am-node32   <none>           <none>
fouramf-multi-vr-cgv69-9-9756-milvus-datacoord-c55598bbc-kqnch    1/1     Running            0                  3h28m   10.104.15.23    4am-node20   <none>           <none>
fouramf-multi-vr-cgv69-9-9756-milvus-datanode-664d7d9668-z78rz    1/1     Running            1 (3h24m ago)      3h28m   10.104.21.228   4am-node24   <none>           <none>
fouramf-multi-vr-cgv69-9-9756-milvus-indexcoord-59c779bcf4jv8wc   1/1     Running            0                  3h28m   10.104.15.24    4am-node20   <none>           <none>
fouramf-multi-vr-cgv69-9-9756-milvus-indexnode-fcd8cdbf9-7j5lv    1/1     Running            0                  3h28m   10.104.32.7     4am-node39   <none>           <none>
fouramf-multi-vr-cgv69-9-9756-milvus-indexnode-fcd8cdbf9-dhs8b    1/1     Running            0                  3h28m   10.104.29.150   4am-node35   <none>           <none>
fouramf-multi-vr-cgv69-9-9756-milvus-indexnode-fcd8cdbf9-v6r5t    1/1     Running            0                  3h28m   10.104.34.153   4am-node37   <none>           <none>
fouramf-multi-vr-cgv69-9-9756-milvus-indexnode-fcd8cdbf9-vq4bv    1/1     Running            0                  3h28m   10.104.30.149   4am-node38   <none>           <none>
fouramf-multi-vr-cgv69-9-9756-milvus-proxy-557f464ccb-xwbzf       1/1     Running            1 (3h23m ago)      3h28m   10.104.34.150   4am-node37   <none>           <none>
fouramf-multi-vr-cgv69-9-9756-milvus-querycoord-cf47b4bc7-bmqrd   1/1     Running            1 (3h23m ago)      3h28m   10.104.15.22    4am-node20   <none>           <none>
fouramf-multi-vr-cgv69-9-9756-milvus-querynode-7b5698dc67-pmk85   0/1     CrashLoopBackOff   31 (2m14s ago)     3h28m   10.104.15.25    4am-node20   <none>           <none>
fouramf-multi-vr-cgv69-9-9756-milvus-querynode-7b5698dc67-tz98j   0/1     CrashLoopBackOff   31 (2m44s ago)     3h28m   10.104.33.32    4am-node36   <none>           <none>
fouramf-multi-vr-cgv69-9-9756-milvus-rootcoord-746c4f49ff-kfhz7   1/1     Running            1 (3h26m ago)      3h28m   10.104.15.21    4am-node20   <none>           <none>
fouramf-multi-vr-cgv69-9-9756-minio-0                             1/1     Running            0                  3h28m   10.104.28.177   4am-node33   <none>           <none>
fouramf-multi-vr-cgv69-9-9756-minio-1                             1/1     Running            0                  3h28m   10.104.26.108   4am-node32   <none>           <none>
fouramf-multi-vr-cgv69-9-9756-minio-2                             1/1     Running            0                  3h28m   10.104.25.208   4am-node30   <none>           <none>
fouramf-multi-vr-cgv69-9-9756-minio-3                             1/1     Running            0                  3h28m   10.104.31.169   4am-node34   <none>           <none>
fouramf-multi-vr-cgv69-9-9756-pulsar-bookie-0                     1/1     Running            0                  3h28m   10.104.16.7     4am-node21   <none>           <none>
fouramf-multi-vr-cgv69-9-9756-pulsar-bookie-1                     1/1     Running            0                  3h28m   10.104.31.170   4am-node34   <none>           <none>
fouramf-multi-vr-cgv69-9-9756-pulsar-bookie-2                     1/1     Running            0                  3h28m   10.104.26.118   4am-node32   <none>           <none>
fouramf-multi-vr-cgv69-9-9756-pulsar-bookie-init-k4zk5            0/1     Completed          0                  3h28m   10.104.9.251    4am-node14   <none>           <none>
fouramf-multi-vr-cgv69-9-9756-pulsar-broker-0                     1/1     Running            0                  3h28m   10.104.9.252    4am-node14   <none>           <none>
fouramf-multi-vr-cgv69-9-9756-pulsar-proxy-0                      1/1     Running            0                  3h28m   10.104.1.226    4am-node10   <none>           <none>
fouramf-multi-vr-cgv69-9-9756-pulsar-pulsar-init-sjghc            0/1     Completed          0                  3h28m   10.104.9.250    4am-node14   <none>           <none>
fouramf-multi-vr-cgv69-9-9756-pulsar-recovery-0                   1/1     Running            0                  3h28m   10.104.6.178    4am-node13   <none>           <none>
fouramf-multi-vr-cgv69-9-9756-pulsar-zookeeper-0                  1/1     Running            0                  3h28m   10.104.28.176   4am-node33   <none>           <none>
fouramf-multi-vr-cgv69-9-9756-pulsar-zookeeper-1                  1/1     Running            0                  3h27m   10.104.27.57    4am-node31   <none>           <none>
fouramf-multi-vr-cgv69-9-9756-pulsar-zookeeper-2                  1/1     Running            0                  3h26m   10.104.25.218   4am-node30   <none>           <none>

fouramf-multi-vr-cgv69-9-9756-milvus-querynode-7b5698dc67-pmk85:

Containers:
  querynode:
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    134
      Started:      Wed, 03 Jan 2024 20:00:38 +0800
      Finished:     Wed, 03 Jan 2024 20:00:55 +0800
    Ready:          False
    Restart Count:  31
    Limits:
      cpu:     8
      memory:  64Gi
    Requests:
      cpu:        5
      memory:     33Gi

queryNode panic log:
fouramf-multi-vr-cgv69-9-9756-milvus-querynode-7b5698dc67-tz98j.log

截屏2024-01-03 20 04 55

client log:
Client loading stuck
截屏2024-01-03 20 06 48

Collection schema: {'auto_id': False, 'description': '', 'fields': [{'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float_vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 128}}, {'name': 'float_vector_1', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 128}}, {'name': 'float_vector_2', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 200}}, {'name': 'float_vector_3', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 200}}, {'name': 'int8_1', 'description': '', 'type': <DataType.INT8: 2>}, {'name': 'int16_1', 'description': '', 'type': <DataType.INT16: 3>}, {'name': 'int32_1', 'description': '', 'type': <DataType.INT32: 4>}, {'name': 'int64_1', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'double_1', 'description': '', 'type': <DataType.DOUBLE: 11>}, {'name': 'float_1', 'description': '', 'type': <DataType.FLOAT: 10>}, {'name': 'varchar_1', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 256}}, {'name': 'bool_1', 'description': '', 'type': <DataType.BOOL: 1>}, {'name': 'int8_2', 'description': '', 'type': <DataType.INT8: 2>}, {'name': 'int16_2', 'description': '', 'type': <DataType.INT16: 3>}, {'name': 'int32_2', 'description': '', 'type': <DataType.INT32: 4>}, {'name': 'int64_2', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'double_2', 'description': '', 'type': <DataType.DOUBLE: 11>}, {'name': 'float_2', 'description': '', 'type': <DataType.FLOAT: 10>}, {'name': 'varchar_2', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 256}}, {'name': 'bool_2', 'description': '', 'type': <DataType.BOOL: 1>}]}

Expected Behavior

queryNode loads the INVERTED index file normally

Steps To Reproduce

1. create collection with fields:
                'float_vector': 128dim, 'float_vector_1': 128dim, 'float_vector_2': 200dim, 'float_vector_3': 200dim,
                'int8_1', 'int16_1', 'int32_1', 'int64_1', 'double_1', 'float_1', 'varchar_1', 'bool_1',
                'int8_2', 'int16_2', 'int32_2', 'int64_2', 'double_2', 'float_2', 'varchar_2', 'bool_2'
            2. build indexes:
                HNSW: 'float_vector', 'float_vector_2'
                DIAKANN_IP: 'float_vector_1'
                DIAKANN_L2: 'float_vector_3'
                scalar_default_index: 'int8_1', 'int16_1', 'int32_1', 'int64_1', 'double_1', 'float_1', 'varchar_1',
                scalar_INVERTED_index: 'int8_2', 'int16_2', 'int32_2', 'int64_2', 'double_2', 'float_2', 'varchar_2', 'bool_2'
            3. insert 5m data
            4. flush collection
            5. build indexed again with the same params
            6. load collection

Milvus Log

No response

Anything else?

client config:

{
     "dataset_params": {
          "metric_type": "L2",
          "dim": 128,
          "scalars_index": {
               "int8_1": {},
               "int16_1": {},
               "int32_1": {},
               "int64_1": {},
               "double_1": {},
               "float_1": {},
               "varchar_1": {},
               "int8_2": {
                    "index_type": "INVERTED"
               },
               "int16_2": {
                    "index_type": "INVERTED"
               },
               "int32_2": {
                    "index_type": "INVERTED"
               },
               "int64_2": {
                    "index_type": "INVERTED"
               },
               "double_2": {
                    "index_type": "INVERTED"
               },
               "float_2": {
                    "index_type": "INVERTED"
               },
               "varchar_2": {
                    "index_type": "INVERTED"
               },
               "bool_2": {
                    "index_type": "INVERTED"
               }
          },
          "vectors_index": {
               "float_vector_1": {
                    "index_type": "DISKANN",
                    "index_param": {},
                    "metric_type": "IP"
               },
               "float_vector_2": {
                    "index_type": "HNSW",
                    "index_param": {
                         "M": 8,
                         "efConstruction": 200
                    },
                    "metric_type": "L2"
               },
               "float_vector_3": {
                    "index_type": "DISKANN",
                    "index_param": {},
                    "metric_type": "L2"
               }
          },
          "scalars_params": {
               "float_vector_1": {
                    "params": {
                         "dim": 128
                    },
                    "other_params": {
                         "dataset": "sift",
                         "dim": 128
                    }
               },
               "float_vector_2": {
                    "params": {
                         "dim": 200
                    },
                    "other_params": {
                         "dataset": "text2img",
                         "dim": 200
                    }
               },
               "float_vector_3": {
                    "params": {
                         "dim": 200
                    },
                    "other_params": {
                         "dataset": "text2img",
                         "dim": 200
                    }
               }
          },
          "dataset_name": "sift",
          "dataset_size": 5000000,
          "ni_per": 5000
     },
     "collection_params": {
          "other_fields": [
               "float_vector_1",
               "float_vector_2",
               "float_vector_3",
               "int8_1",
               "int16_1",
               "int32_1",
               "int64_1",
               "double_1",
               "float_1",
               "varchar_1",
               "bool_1",
               "int8_2",
               "int16_2",
               "int32_2",
               "int64_2",
               "double_2",
               "float_2",
               "varchar_2",
               "bool_2"
          ],
          "shards_num": 2
     },
     "resource_groups_params": {
          "reset": false
     },
     "database_user_params": {
          "reset_rbac": false,
          "reset_db": false
     },
     "index_params": {
          "index_type": "HNSW",
          "index_param": {
               "M": 8,
               "efConstruction": 200
          }
     },
     "concurrent_params": {
          "concurrent_number": [
               20
          ],
          "during_time": "1h",
          "interval": 20
     },
     "concurrent_tasks": [
          {
               "type": "insert",
               "weight": 1,
               "params": {
                    "nb": 10,
                    "timeout": 30,
                    "random_id": true,
                    "random_vector": true,
                    "varchar_filled": false,
                    "start_id": 0
               }
          },
          {
               "type": "delete",
               "weight": 1,
               "params": {
                    "expr": "",
                    "delete_length": 9,
                    "timeout": 30
               }
          },
          {
               "type": "flush",
               "weight": 1,
               "params": {
                    "timeout": 30
               }
          },
          {
               "type": "load",
               "weight": 1,
               "params": {
                    "replica_number": 1,
                    "timeout": 30
               }
          },
          {
               "type": "query",
               "weight": 1,
               "params": {
                    "ids": null,
                    "expr": "int64_1 > -1 &&  int64_2 > -1 && ",
                    "output_fields": [
                         "*"
                    ],
                    "offset": null,
                    "limit": null,
                    "ignore_growing": false,
                    "timeout": 60,
                    "random_data": true,
                    "random_count": 20,
                    "random_range": [
                         2500000,
                         5000000
                    ],
                    "field_name": "id",
                    "field_type": "int64"
               }
          }
     ]
}

server config:

{
     "queryNode": {
          "resources": {
               "limits": {
                    "cpu": "8.0",
                    "memory": "64Gi"
               },
               "requests": {
                    "cpu": "5.0",
                    "memory": "33Gi"
               }
          },
          "replicas": 2
     },
     "indexNode": {
          "resources": {
               "limits": {
                    "cpu": "8.0",
                    "memory": "16Gi"
               },
               "requests": {
                    "cpu": "5.0",
                    "memory": "9Gi"
               }
          },
          "replicas": 4
     },
     "dataNode": {
          "resources": {
               "limits": {
                    "cpu": "8.0",
                    "memory": "16Gi"
               },
               "requests": {
                    "cpu": "5.0",
                    "memory": "9Gi"
               }
          }
     },
     "cluster": {
          "enabled": true
     },
     "pulsar": {},
     "kafka": {},
     "minio": {
          "metrics": {
               "podMonitor": {
                    "enabled": true
               }
          }
     },
     "etcd": {
          "metrics": {
               "enabled": true,
               "podMonitor": {
                    "enabled": true
               }
          }
     },
     "metrics": {
          "serviceMonitor": {
               "enabled": true
          }
     },
     "log": {
          "level": "debug"
     },
     "image": {
          "all": {
               "repository": "harbor.milvus.io/milvus/milvus",
               "tag": "master-20231231-3f46c6d4"
          }
     }
}
@wangting0128 wangting0128 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 3, 2024
@wangting0128 wangting0128 added this to the 2.4.0 milestone Jan 3, 2024
@wangting0128 wangting0128 added the test/benchmark benchmark test label Jan 3, 2024
@yanliang567 yanliang567 removed their assignment Jan 4, 2024
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 4, 2024
@xiaofan-luan
Copy link
Collaborator

/assign @longjiquan

@longjiquan
Copy link
Contributor

working on this

sre-ci-robot pushed a commit that referenced this issue Jan 7, 2024
issue: #29654

---------

Signed-off-by: longjiquan <jiquan.long@zilliz.com>
@longjiquan
Copy link
Contributor

fixed. please verify this.
/assign @wangting0128

@wangting0128
Copy link
Contributor Author

verification passed

milvus image:master-20240109-2f702ad3
argo task:inverted-corn-mbgtf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug test/benchmark benchmark test triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants