Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]:[perf-nightly] Milvus load 1million 768d data failed , raise an error"collection has not been loaded" #26131

Closed
1 task done
jingkl opened this issue Aug 4, 2023 · 5 comments
Assignees
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@jingkl
Copy link
Contributor

jingkl commented Aug 4, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:master-20230803-e87a9147
- Deployment mode(standalone or cluster):standalone
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Argo :
release_name_prefix
perf-single-1691026200
deploy_config
fouramf-server-standalone-8c16m
case_params
fouramf-client-gist1m-concurrent-hnsw

Test params:
[2023-08-03 01:55:27,891 - INFO - fouram]: [check_params] scene_concurrent_locust required params: {'dataset_params': {'metric_type': 'L2', 'dim': 768, 'dataset_name': 'gist', 'dataset_size': '1m', 'ni_per': 1000}, 'collection_params': {'other_fields': []}, 'load_params': {}, 'query_params': {}, 'search_params': {}, 'index_params': {'index_type': 'HNSW', 'index_param': {'M': 8, 'efConstruction': 200}}, 'concurrent_params': {'concurrent_number': [1, 20, 50, 100], 'during_time': 3600, 'interval': 20}, 'concurrent_tasks': [{'type': 'search', 'weight': 1, 'params': {'nq': 1, 'top_k': 1, 'search_param': {'ef': 16}, 'random_data': True}}]} (params_check.py:31)

client log:

[2023-08-03 02:06:11,156 -  INFO - fouram]: [CommonCases] RT of build index HNSW: 170.6812s (common_cases.py:96)
[2023-08-03 02:06:11,158 -  INFO - fouram]: [Base] Params of index: [{'float_vector': {'index_type': 'HNSW', 'metric_type': 'L2', 'params': {'M': 8, 'efConstruction': 200}}}] (base.py:456)
[2023-08-03 02:06:11,158 -  INFO - fouram]: [CommonCases] Prepare index HNSW done. (common_cases.py:99)
[2023-08-03 02:06:11,158 -  INFO - fouram]: [CommonCases] No scalars need to be indexed. (common_cases.py:107)
[2023-08-03 02:06:11,159 -  INFO - fouram]: [Base] Number of vectors in the collection(fouram_aaQgTG0U): 1000000 (base.py:483)
[2023-08-03 02:06:11,159 -  INFO - fouram]: [Base] Start load collection fouram_aaQgTG0U,replica_number:1,kwargs:{} (base.py:298)
[2023-08-03 02:06:44,352 - WARNING - fouram]: �[93m[get_loading_progress] retry:4, cost: 0.27s, reason: <_MultiThreadedRendezvous: StatusCode.UNAVAILABLE, failed to connect to all addresses; last error: UNKNOWN: ipv4:10.255.92.113:19530: Failed to connect to remote host: Connection refused>�[0m (decorators.py:71)
[2023-08-03 02:06:44,624 - WARNING - fouram]: �[93m[get_loading_progress] retry:5, cost: 0.81s, reason: <_MultiThreadedRendezvous: StatusCode.UNAVAILABLE, failed to connect to all addresses; last error: UNKNOWN: ipv4:10.255.92.113:19530: Failed to connect to remote host: Connection refused>�[0m (decorators.py:71)
[2023-08-03 02:06:45,436 - WARNING - fouram]: �[93m[get_loading_progress] retry:6, cost: 2.43s, reason: <_MultiThreadedRendezvous: StatusCode.UNAVAILABLE, failed to connect to all addresses; last error: UNKNOWN: ipv4:10.255.92.113:19530: Failed to connect to remote host: Connection refused>�[0m (decorators.py:71)
[2023-08-03 02:06:47,869 - WARNING - fouram]: �[93m[get_loading_progress] retry:7, cost: 7.29s, reason: <_MultiThreadedRendezvous: StatusCode.UNAVAILABLE, failed to connect to all addresses; last error: UNKNOWN: ipv4:10.255.92.113:19530: Failed to connect to remote host: Connection refused>�[0m (decorators.py:71)
[2023-08-03 02:06:55,167 - WARNING - fouram]: �[93m[get_loading_progress] retry:8, cost: 21.87s, reason: <_MultiThreadedRendezvous: StatusCode.UNAVAILABLE, failed to connect to all addresses; last error: UNKNOWN: ipv4:10.255.92.113:19530: Failed to connect to remote host: Connection refused>�[0m (decorators.py:71)
[2023-08-03 02:07:17,064 - WARNING - fouram]: �[93m[get_loading_progress] retry:9, cost: 60.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNAVAILABLE, failed to connect to all addresses; last error: UNKNOWN: ipv4:10.255.92.113:19530: Failed to connect to remote host: Connection refused>�[0m (decorators.py:71)
[2023-08-03 02:08:17,072 - WARNING - fouram]: �[93m[get_loading_progress] retry:10, cost: 60.00s, reason: <_MultiThreadedRendezvous: StatusCode.UNAVAILABLE, failed to connect to all addresses; last error: UNKNOWN: ipv4:10.255.92.113:19530: Failed to connect to remote host: Connection refused>�[0m (decorators.py:71)
[2023-08-03 02:09:17,135 - WARNING - fouram]: �[93mRetry run out of 10 retry times�[0m (decorators.py:79)
[2023-08-03 02:09:17,135 - ERROR - fouram]: RPC error: [get_loading_progress], <MilvusException: (code=1, message=Retry run out of 10 retry times, message=collection 443292760107122934 has not been loaded to memory or load failed)>, <Time:{'RPC start': '2023-08-03 02:06:44.044281', 'RPC error': '2023-08-03 02:09:17.135308'}> (decorators.py:108)
[2023-08-03 02:09:17,136 - ERROR - fouram]: RPC error: [wait_for_loading_collection], <MilvusException: (code=1, message=Retry run out of 10 retry times, message=collection 443292760107122934 has not been loaded to memory or load failed)>, <Time:{'RPC start': '2023-08-03 02:06:11.166348', 'RPC error': '2023-08-03 02:09:17.136060'}> (decorators.py:108)
[2023-08-03 02:09:17,136 - ERROR - fouram]: RPC error: [load_collection], <MilvusException: (code=1, message=Retry run out of 10 retry times, message=collection 443292760107122934 has not been loaded to memory or load failed)>, <Time:{'RPC start': '2023-08-03 02:06:11.159696', 'RPC error': '2023-08-03 02:09:17.136173'}> (decorators.py:108)
[2023-08-03 02:09:17,137 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=Retry run out of 10 retry times, message=collection 443292760107122934 has not been loaded to memory or load failed)> (api_request.py:53)
[2023-08-03 02:09:17,137 - ERROR - fouram]: [CheckFunc] load request check failed, response:<MilvusException: (code=1, message=Retry run out of 10 retry times, message=collection 443292760107122934 has not been loaded to memory or load failed)>

server:

NAME                                                              READY   STATUS                   RESTARTS          AGE     IP              NODE         NOMINATED NODE   READINESS GATES
perf-single-16926200-2-99-7819-etcd-0                             1/1     Running                  0                 2m34s   10.104.15.73    4am-node20   <none>           <none>
perf-single-16926200-2-99-7819-milvus-standalone-dd55787496vqwm   1/1     Running                  0                 2m34s   10.104.9.220    4am-node14   <none>           <none>
perf-single-16926200-2-99-7819-minio-9975d55f-sc5wm               1/1     Running                  0                 2m34s   10.104.15.71    4am-node20   <none>           <none> (base.py:218)
[2023-08-03 02:09:17,320 -  INFO - fouram]: [Cmd Exe]  kubectl get pods  -n qa-milvus  -o wide | grep -E 'STATUS|perf-single-16926200-2-99-7819-milvus|perf-single-16926200-2-99-7819-minio|perf-single-16926200-2-99-7819-etcd|perf-single-16926200-2-99-7819-pulsar|perf-single-16926200-2-99-7819-kafka'  (util_cmd.py:14)
[2023-08-03 02:09:26,360 -  INFO - fouram]: [CliClient] pod details of release(perf-single-16926200-2-99-7819): 
 I0803 02:09:18.594368     513 request.go:665] Waited for 1.168276716s due to client-side throttling, not priority and fairness, request: GET:https://kubernetes.default.svc.cluster.local/apis/authorization.k8s.io/v1?timeout=32s
NAME                                                              READY   STATUS                   RESTARTS          AGE     IP              NODE         NOMINATED NODE   READINESS GATES
perf-single-16926200-2-99-7819-etcd-0                             1/1     Running                  0                 16m     10.104.15.73    4am-node20   <none>           <none>
perf-single-16926200-2-99-7819-milvus-standalone-dd55787496vqwm   1/1     Running                  1 (2m42s ago)     16m     10.104.9.220    4am-node14   <none>           <none>
perf-single-16926200-2-99-7819-minio-9975d55f-sc5wm               1/1     Running                  0                 16m     10.104.15.71    4am-node20   <none>           <none> (cli_client.py:132)

Milvus log:
截屏2023-08-04 15 09 54

Expected Behavior

load collection successfully

Steps To Reproduce

Test step:

        1. create a collection 
        2. build hnsw index on vector column
        3. insert 1m 768d of vectors
        4. flush collection
        5. build index on vector column with the same parameters
        6. build index on on scalars column or not
        7. count the total number of rows
        8. load collection ——》raise an error

Milvus Log

No response

Anything else?

No response

@jingkl jingkl added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 4, 2023
@yanliang567
Copy link
Contributor

seems that milvus restarted for etcd connection refused. But this test already ran on 4am cluster which configured fast ssd volume.
/assign @jiaoew1991
/unassign

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 4, 2023
@yanliang567 yanliang567 added this to the 2.3 milestone Aug 4, 2023
@jingkl
Copy link
Contributor Author

jingkl commented Aug 4, 2023

This problem should not be caused by etcd, it should be a querynode problem.

@jiaoew1991
Copy link
Contributor

/assign @yah01
/unassign

@yah01
Copy link
Member

yah01 commented Aug 7, 2023

/assign @jingkl
pls check with #26135

@jingkl
Copy link
Contributor Author

jingkl commented Aug 11, 2023

release_name_prefix
perf-single-1691717400
deploy_config
fouramf-server-standalone-8c16m
case_params
fouramf-client-gist1m-concurrent-hnsw
case_name
test_concurrent_locust_custom_parameters

image:master-20230810-0f9aa5fb

already fix close issue

@jingkl jingkl closed this as completed Aug 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants