[Bug]: [benchmark][stramingNode] queryNode OOM in concurrent DQL & DML scene with shard_num=16 #36760
Closed
Description
Is there an existing issue for this?
- I have searched the existing issues
Environment
- Milvus version:master-20241009-c3d91075-amd64
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):pulsar
- SDK version(e.g. pymilvus v2.0.0rc2):2.4.5rc7
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
argo task: fouramf-997f4
test case name: test_bitmap_locust_shard16_dql_cluster
server:
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
wt-streaming-node-shard16-etcd-0 1/1 Running 0 4m41s 10.104.23.87 4am-node27 <none> <none>
wt-streaming-node-shard16-etcd-1 1/1 Running 0 4m41s 10.104.17.55 4am-node23 <none> <none>
wt-streaming-node-shard16-etcd-2 1/1 Running 0 4m41s 10.104.18.208 4am-node25 <none> <none>
wt-streaming-node-shard16-milvus-datanode-8454b67476-brprz 1/1 Running 1 (4m15s ago) 4m41s 10.104.4.36 4am-node11 <none> <none>
wt-streaming-node-shard16-milvus-indexnode-65b4459795-72lxv 1/1 Running 1 (4m14s ago) 4m41s 10.104.5.12 4am-node12 <none> <none>
wt-streaming-node-shard16-milvus-indexnode-65b4459795-8g8xx 1/1 Running 1 (4m14s ago) 4m41s 10.104.9.153 4am-node14 <none> <none>
wt-streaming-node-shard16-milvus-indexnode-65b4459795-dhhdv 1/1 Running 1 (4m13s ago) 4m41s 10.104.6.200 4am-node13 <none> <none>
wt-streaming-node-shard16-milvus-indexnode-65b4459795-pmgxb 1/1 Running 1 (4m15s ago) 4m41s 10.104.20.135 4am-node22 <none> <none>
wt-streaming-node-shard16-milvus-mixcoord-7f59df8b-svbsz 1/1 Running 1 (4m13s ago) 4m41s 10.104.14.107 4am-node18 <none> <none>
wt-streaming-node-shard16-milvus-proxy-f5db84688-tzkv7 1/1 Running 1 (4m13s ago) 4m41s 10.104.6.199 4am-node13 <none> <none>
wt-streaming-node-shard16-milvus-querynode-85cc984c7c-8mc28 1/1 Running 0 4m41s 10.104.33.234 4am-node36 <none> <none>
wt-streaming-node-shard16-milvus-querynode-85cc984c7c-k2nc5 1/1 Running 1 (4m13s ago) 4m41s 10.104.14.109 4am-node18 <none> <none>
wt-streaming-node-shard16-milvus-streamingnode-6c5b5fc984-m2d7g 1/1 Running 1 (4m14s ago) 4m41s 10.104.5.9 4am-node12 <none> <none>
wt-streaming-node-shard16-minio-0 1/1 Running 0 4m41s 10.104.17.54 4am-node23 <none> <none>
wt-streaming-node-shard16-minio-1 1/1 Running 0 4m41s 10.104.34.154 4am-node37 <none> <none>
wt-streaming-node-shard16-minio-2 1/1 Running 0 4m40s 10.104.23.91 4am-node27 <none> <none>
wt-streaming-node-shard16-minio-3 1/1 Running 0 4m40s 10.104.19.23 4am-node28 <none> <none>
wt-streaming-node-shard16-pulsar-bookie-0 1/1 Running 0 4m41s 10.104.18.207 4am-node25 <none> <none>
wt-streaming-node-shard16-pulsar-bookie-1 1/1 Running 0 4m41s 10.104.25.25 4am-node30 <none> <none>
wt-streaming-node-shard16-pulsar-bookie-2 1/1 Running 0 4m40s 10.104.23.92 4am-node27 <none> <none>
wt-streaming-node-shard16-pulsar-bookie-init-4g78r 0/1 Completed 0 4m41s 10.104.1.74 4am-node10 <none> <none>
wt-streaming-node-shard16-pulsar-broker-0 1/1 Running 0 4m41s 10.104.13.166 4am-node16 <none> <none>
wt-streaming-node-shard16-pulsar-proxy-0 1/1 Running 0 4m41s 10.104.13.165 4am-node16 <none> <none>
wt-streaming-node-shard16-pulsar-pulsar-init-ldf2s 0/1 Completed 0 4m41s 10.104.34.152 4am-node37 <none> <none>
wt-streaming-node-shard16-pulsar-recovery-0 1/1 Running 0 4m41s 10.104.1.73 4am-node10 <none> <none>
wt-streaming-node-shard16-pulsar-zookeeper-0 1/1 Running 0 4m41s 10.104.19.18 4am-node28 <none> <none>
wt-streaming-node-shard16-pulsar-zookeeper-1 1/1 Running 0 4m 10.104.30.109 4am-node38 <none> <none>
wt-streaming-node-shard16-pulsar-zookeeper-2 1/1 Running 0 3m25s 10.104.34.156 4am-node37 <none> <none>
Comparison and verification: In the case where streamingNode is not enabled, queryNode does not OOM 👇
Expected Behavior
No response
Steps To Reproduce
concurrent test and calculation of RT and QPS
:purpose: `primary key: INT64`, shard_num=16, DQL without expr
1. building `BITMAP` index on all supported 12 scalar fields, hybrid index on INT64 primary key field
2. the other 22 scalar fields build `INVERTED`, `Trie`, `STL_SORT` indexes
3. 2 fields of different vector types
4. search for different expressions on BITMAP index fields
:test steps:
1. create collection with fields:
'float_vector': 128dim
'sparse_float_vector': sparse_range=[1, 100] <- the range of non-zero values of a sparse vector
'id': primary key type is INT64
all scalar fields: varchar max_length=100, array max_capacity=11
2. build indexes:
IVF_SQ8: 'float_vector'
SPARSE_WAND: 'sparse_float_vector'
default scalar index: 'id'
BITMAP: '*_1' all supported field names
INVERTED: 'array_float_1', 'array_double_1', 'float_2', 'double_2', 'bool_2', 'array_int8_2',
'array_int16_2', 'array_int32_2', 'array_int64_2', 'array_varchar_2', 'array_bool_2',
'array_float_2', 'array_double_2'
Trie: 'varchar_2'
STL_SORT: 'float_1', 'double_1', 'int8_2', 'int16_2', 'int32_2', 'int64_2'
3. insert 5 million data
4. flush collection
5. build indexes again using the same params
6. load collection
7. concurrent request:
- search
- query
- hybrid_search
Milvus Log
No response
Anything else?
test result:
[2024-10-09 10:24:51,502 - INFO - fouram]: Type Name # reqs # fails | Avg Min Max Med | req/s failures/s (stats.py:789)
[2024-10-09 10:24:51,502 - INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-10-09 10:24:51,502 - INFO - fouram]: grpc hybrid_search 986 1(0.10%) | 5905 0 54056 3300 | 0.55 0.00 (stats.py:789)
[2024-10-09 10:24:51,502 - INFO - fouram]: grpc query 968 3(0.31%) | 37217 0 110610 30000 | 0.54 0.00 (stats.py:789)
[2024-10-09 10:24:51,502 - INFO - fouram]: grpc search 990 23(2.32%) | 9085 0 31028 6100 | 0.55 0.01 (stats.py:789)
[2024-10-09 10:24:51,502 - INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-10-09 10:24:51,502 - INFO - fouram]: Aggregated 2944 27(0.92%) | 17270 0 110610 9300 | 1.63 0.01 (stats.py:789)
[2024-10-09 10:24:51,502 - INFO - fouram]: (stats.py:790)
[2024-10-09 10:24:51,507 - INFO - fouram]: [PerfTemplate] Report data:
{'server': {'deploy_tool': 'helm',
'deploy_mode': 'cluster',
'config_name': 'cluster_8c16m',
'config': {'queryNode': {'resources': {'limits': {'cpu': '32.0', 'memory': '16Gi'}, 'requests': {'cpu': '17.0', 'memory': '9Gi'}}, 'replicas': 2},
'indexNode': {'resources': {'limits': {'cpu': '4.0', 'memory': '8Gi'}, 'requests': {'cpu': '3.0', 'memory': '5Gi'}}, 'replicas': 4},
'dataNode': {'resources': {'limits': {'cpu': '8.0', 'memory': '16Gi'}, 'requests': {'cpu': '5.0', 'memory': '9Gi'}}},
'cluster': {'enabled': True},
'pulsar': {'enabled': True},
'kafka': {},
'minio': {'metrics': {'podMonitor': {'enabled': True}}},
'etcd': {'metrics': {'enabled': True, 'podMonitor': {'enabled': True}}},
'metrics': {'serviceMonitor': {'enabled': True}},
'log': {'level': 'debug'},
'standalone': {'messageQueue': 'pulsar'},
'streaming': {'enabled': True},
'image': {'all': {'repository': 'harbor.milvus.io/milvus/milvus', 'tag': 'master-20241009-c3d91075-amd64'}}},
'host': 'wt-streaming-node-shard16-milvus.qa-milvus.svc.cluster.local',
'port': '19530',
'uri': ''},
'client': {'test_case_type': 'ConcurrentClientBase',
'test_case_name': 'test_bitmap_locust_shard16_dql_cluster',
'test_case_params': {'dataset_params': {'metric_type': 'L2',
'dim': 128,
'max_length': 100,
'scalars_index': {'int8_1': {'index_type': 'BITMAP'},
'int16_1': {'index_type': 'BITMAP'},
'int32_1': {'index_type': 'BITMAP'},
'int64_1': {'index_type': 'BITMAP'},
'varchar_1': {'index_type': 'BITMAP'},
'bool_1': {'index_type': 'BITMAP'},
'array_int8_1': {'index_type': 'BITMAP'},
'array_int16_1': {'index_type': 'BITMAP'},
'array_int32_1': {'index_type': 'BITMAP'},
'array_int64_1': {'index_type': 'BITMAP'},
'array_varchar_1': {'index_type': 'BITMAP'},
'array_bool_1': {'index_type': 'BITMAP'},
'array_float_1': {'index_type': 'INVERTED'},
'array_double_1': {'index_type': 'INVERTED'},
'float_2': {'index_type': 'INVERTED'},
'double_2': {'index_type': 'INVERTED'},
'bool_2': {'index_type': 'INVERTED'},
'array_int8_2': {'index_type': 'INVERTED'},
'array_int16_2': {'index_type': 'INVERTED'},
'array_int32_2': {'index_type': 'INVERTED'},
'array_int64_2': {'index_type': 'INVERTED'},
'array_varchar_2': {'index_type': 'INVERTED'},
'array_bool_2': {'index_type': 'INVERTED'},
'array_float_2': {'index_type': 'INVERTED'},
'array_double_2': {'index_type': 'INVERTED'},
'varchar_2': {'index_type': 'Trie'},
'float_1': {'index_type': 'STL_SORT'},
'double_1': {'index_type': 'STL_SORT'},
'int8_2': {'index_type': 'STL_SORT'},
'int16_2': {'index_type': 'STL_SORT'},
'int32_2': {'index_type': 'STL_SORT'},
'int64_2': {'index_type': 'STL_SORT'}},
'vectors_index': {'sparse_float_vector': {'index_type': 'SPARSE_INVERTED_INDEX',
'index_param': {'drop_ratio_build': 0.2},
'metric_type': 'IP'}},
'scalars_params': {'array_int8_1': {'params': {'max_capacity': 11},
'other_params': {'dataset': 'random_algorithm',
'algorithm_params': {'algorithm_name': 'random_range',
'specify_range': [-2500, 2500],
'max_capacity': 9}}},
'array_int16_1': {'params': {'max_capacity': 11},
'other_params': {'dataset': 'random_algorithm',
'algorithm_params': {'algorithm_name': 'random_range',
'specify_range': [-2500, 2500],
'max_capacity': 9}}},
'array_int32_1': {'params': {'max_capacity': 11},
'other_params': {'dataset': 'random_algorithm',
'algorithm_params': {'algorithm_name': 'random_range',
'specify_range': [-2500, 2500],
'max_capacity': 9}}},
'array_int64_1': {'params': {'max_capacity': 11},
'other_params': {'dataset': 'random_algorithm',
'algorithm_params': {'algorithm_name': 'random_range',
'specify_range': [-2500, 2500],
'max_capacity': 9}}},
'array_double_1': {'params': {'max_capacity': 11}},
'array_float_1': {'params': {'max_capacity': 11}},
'array_varchar_1': {'params': {'max_capacity': 11},
'other_params': {'dataset': 'random_algorithm',
'algorithm_params': {'algorithm_name': 'random_range',
'specify_range': [-2500, 2500],
'max_capacity': 9}}},
'array_bool_1': {'params': {'max_capacity': 11},
'other_params': {'dataset': 'random_algorithm',
'algorithm_params': {'algorithm_name': 'random_range',
'specify_range': [-2500, 2500],
'max_capacity': 9}}},
'array_int8_2': {'params': {'max_capacity': 11}},
'array_int16_2': {'params': {'max_capacity': 11}},
'array_int32_2': {'params': {'max_capacity': 11}},
'array_int64_2': {'params': {'max_capacity': 11}},
'array_double_2': {'params': {'max_capacity': 11}},
'array_float_2': {'params': {'max_capacity': 11}},
'array_varchar_2': {'params': {'max_capacity': 11}},
'array_bool_2': {'params': {'max_capacity': 11}},
'int8_1': {'other_params': {'dataset': 'random_algorithm',
'algorithm_params': {'algorithm_name': 'random_range',
'specify_range': [-2500, 2500],
'max_capacity': 9}}},
'int16_1': {'other_params': {'dataset': 'random_algorithm',
'algorithm_params': {'algorithm_name': 'random_range',
'specify_range': [-2500, 2500],
'max_capacity': 9}}},
'int32_1': {'other_params': {'dataset': 'random_algorithm',
'algorithm_params': {'algorithm_name': 'random_range',
'specify_range': [-2500, 2500],
'max_capacity': 9}}},
'int64_1': {'other_params': {'dataset': 'random_algorithm',
'algorithm_params': {'algorithm_name': 'random_range',
'specify_range': [-2500, 2500],
'max_capacity': 9}}},
'varchar_1': {'other_params': {'dataset': 'random_algorithm',
'algorithm_params': {'algorithm_name': 'random_range',
'specify_range': [-2500, 2500],
'max_capacity': 9}}},
'bool_1': {'other_params': {'dataset': 'random_algorithm',
'algorithm_params': {'algorithm_name': 'random_range',
'specify_range': [-2500, 2500],
'max_capacity': 9}}}},
'dataset_name': 'sift',
'dataset_size': 5000000,
'ni_per': 5000},
'collection_params': {'other_fields': ['sparse_float_vector', 'int8_1', 'int16_1', 'int32_1', 'int64_1', 'double_1', 'float_1',
'varchar_1', 'bool_1', 'json_1', 'array_int8_1', 'array_int16_1', 'array_int32_1',
'array_int64_1', 'array_double_1', 'array_float_1', 'array_varchar_1', 'array_bool_1',
'int8_2', 'int16_2', 'int32_2', 'int64_2', 'double_2', 'float_2', 'varchar_2', 'bool_2',
'json_2', 'array_int8_2', 'array_int16_2', 'array_int32_2', 'array_int64_2',
'array_double_2', 'array_float_2', 'array_varchar_2', 'array_bool_2'],
'shards_num': 16},
'flush_params': {'prepare_flush': False},
'resource_groups_params': {'reset': False},
'database_user_params': {'reset_rbac': False, 'reset_db': False},
'index_params': {'index_type': 'IVF_SQ8', 'index_param': {'nlist': 1024}},
'concurrent_params': {'concurrent_number': 30, 'during_time': '30m', 'interval': 20, 'spawn_rate': None},
'concurrent_tasks': [{'type': 'search',
'weight': 1,
'params': {'nq': 1000,
'top_k': 10,
'search_param': {'nprobe': 16},
'expr': 'id >= 100',
'guarantee_timestamp': None,
'partition_names': None,
'output_fields': ['*'],
'ignore_growing': False,
'group_by_field': None,
'timeout': 30,
'random_data': True,
'check_task': 'check_search_output',
'check_items': {'output_fields': ['sparse_float_vector', 'int8_1', 'int16_1', 'int32_1',
'int64_1', 'double_1', 'float_1', 'varchar_1', 'bool_1',
'json_1', 'array_int8_1', 'array_int16_1', 'array_int32_1',
'array_int64_1', 'array_double_1', 'array_float_1',
'array_varchar_1', 'array_bool_1', 'int8_2', 'int16_2',
'int32_2', 'int64_2', 'double_2', 'float_2', 'varchar_2',
'bool_2', 'json_2', 'array_int8_2', 'array_int16_2',
'array_int32_2', 'array_int64_2', 'array_double_2',
'array_float_2', 'array_varchar_2', 'array_bool_2', 'id',
'float_vector'],
'nq': 1000}}},
{'type': 'query',
'weight': 1,
'params': {'ids': None,
'expr': 'id > -1 && ',
'output_fields': ['id', 'float_vector', 'int64_1'],
'offset': None,
'limit': None,
'ignore_growing': False,
'partition_names': None,
'timeout': 30,
'consistency_level': None,
'random_data': True,
'random_count': 10,
'random_range': [0, 5000000],
'field_name': 'id',
'field_type': 'int64',
'check_task': 'check_query_output',
'check_items': None}},
{'type': 'hybrid_search',
'weight': 1,
'params': {'nq': 10,
'top_k': 10,
'reqs': [{'search_param': {'nprobe': 128}, 'anns_field': 'float_vector', 'top_k': 100},
{'search_param': {'nprobe': 128}, 'anns_field': 'float_vector', 'top_k': 100},
{'search_param': {'nprobe': 128}, 'anns_field': 'float_vector', 'top_k': 100},
{'search_param': {'nprobe': 128}, 'anns_field': 'float_vector', 'top_k': 100},
{'search_param': {'nprobe': 128}, 'anns_field': 'float_vector', 'top_k': 100},
{'search_param': {'nprobe': 128}, 'anns_field': 'float_vector', 'top_k': 100},
{'search_param': {'nprobe': 128}, 'anns_field': 'float_vector', 'top_k': 100},
{'search_param': {'nprobe': 128}, 'anns_field': 'float_vector', 'top_k': 100},
{'search_param': {'nprobe': 128}, 'anns_field': 'float_vector', 'top_k': 100},
{'search_param': {'nprobe': 128}, 'anns_field': 'float_vector', 'top_k': 100},
{'search_param': {'nprobe': 128}, 'anns_field': 'float_vector', 'top_k': 100},
{'search_param': {'nprobe': 128}, 'anns_field': 'float_vector', 'top_k': 100},
{'search_param': {'drop_ratio_search': 0.1}, 'anns_field': 'sparse_float_vector'},
{'search_param': {'drop_ratio_search': 0.1}, 'anns_field': 'sparse_float_vector'}],
'rerank': {'RRFRanker': []},
'output_fields': ['*'],
'ignore_growing': False,
'guarantee_timestamp': None,
'partition_names': None,
'timeout': 1800,
'random_data': True,
'check_task': 'check_search_output',
'check_items': {'output_fields': ['sparse_float_vector', 'int8_1', 'int16_1', 'int32_1',
'int64_1', 'double_1', 'float_1', 'varchar_1', 'bool_1',
'json_1', 'array_int8_1', 'array_int16_1', 'array_int32_1',
'array_int64_1', 'array_double_1', 'array_float_1',
'array_varchar_1', 'array_bool_1', 'int8_2', 'int16_2',
'int32_2', 'int64_2', 'double_2', 'float_2', 'varchar_2',
'bool_2', 'json_2', 'array_int8_2', 'array_int16_2',
'array_int32_2', 'array_int64_2', 'array_double_2',
'array_float_2', 'array_varchar_2', 'array_bool_2', 'id',
'float_vector'],
'nq': 10}}}]},
'run_id': 2024100961955372,
'datetime': '2024-10-09 09:29:55.997726',
'client_version': '2.2'},
'result': {'test_result': {'index': {'RT': 0.5157,
'sparse_float_vector': {'RT': 0.5148},
'int8_1': {'RT': 0.5152},
'int16_1': {'RT': 0.5167},
'int32_1': {'RT': 0.5166},
'int64_1': {'RT': 0.5169},
'varchar_1': {'RT': 0.5146},
'bool_1': {'RT': 0.5152},
'array_int8_1': {'RT': 0.5238},
'array_int16_1': {'RT': 0.514},
'array_int32_1': {'RT': 0.5134},
'array_int64_1': {'RT': 0.5158},
'array_varchar_1': {'RT': 0.5152},
'array_bool_1': {'RT': 0.5139},
'array_float_1': {'RT': 0.5147},
'array_double_1': {'RT': 0.5146},
'float_2': {'RT': 0.5157},
'double_2': {'RT': 0.5149},
'bool_2': {'RT': 0.6093},
'array_int8_2': {'RT': 0.5146},
'array_int16_2': {'RT': 0.5329},
'array_int32_2': {'RT': 0.5178},
'array_int64_2': {'RT': 0.5143},
'array_varchar_2': {'RT': 0.5147},
'array_bool_2': {'RT': 0.5172},
'array_float_2': {'RT': 0.5156},
'array_double_2': {'RT': 0.5146},
'varchar_2': {'RT': 0.5159},
'float_1': {'RT': 0.5173},
'double_1': {'RT': 0.5152},
'int8_2': {'RT': 0.5141},
'int16_2': {'RT': 0.5143},
'int32_2': {'RT': 0.5158},
'int64_2': {'RT': 0.5156}},
'insert': {'total_time': 882.4758, 'VPS': 5665.8777, 'batch_time': 0.8825, 'batch': 5000},
'load': {'RT': 8.1154},
'Locust': {'Aggregated': {'Requests': 2944,
'Fails': 27,
'RPS': 1.63,
'fail_s': 0.01,
'RT_max': 110610.16,
'RT_avg': 17270.34,
'TP50': 9300.0,
'TP99': 86000.0},
'hybrid_search': {'Requests': 986,
'Fails': 1,
'RPS': 0.55,
'fail_s': 0.0,
'RT_max': 54056.68,
'RT_avg': 5905.07,
'TP50': 3300.0,
'TP99': 34000.0},
'query': {'Requests': 968,
'Fails': 3,
'RPS': 0.54,
'fail_s': 0.0,
'RT_max': 110610.16,
'RT_avg': 37217.49,
'TP50': 30000.0,
'TP99': 102000.0},
'search': {'Requests': 990,
'Fails': 23,
'RPS': 0.55,
'fail_s': 0.02,
'RT_max': 31028.89,
'RT_avg': 9085.81,
'TP50': 6100.0,
'TP99': 29000.0}}}}}