Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Insert timeout when running test_e2e.py after Milvus recovered from datanode pod kill chaos #17537

Closed
1 task done
zhuwenxing opened this issue Jun 14, 2022 · 13 comments
Assignees
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@zhuwenxing
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:master-20220613-e9dcda16
- Deployment mode(standalone or cluster):cluster
- SDK version(e.g. pymilvus v2.0.0rc2):pymilvus==2.1.0.dev69
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Insert timeout when running test_e2e.py after Milvus recovered from datanode pod kill chaos
Actually, create collection also cost a lot time, 23s.

[2022-06-13 18:55:26 - INFO - ci_test]: assert create collection: 23.063071489334106, init_entities: 0 (test_e2e.py:24)
[2022-06-13 18:55:47 - ERROR - pymilvus.decorators]: grpc RpcError: [bulk_insert], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-06-13 18:55:27.057163', 'gRPC error': '2022-06-13 18:55:47.111122'}> (decorators.py:86)
[2022-06-13 18:55:47 - ERROR - ci_test]: Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 44, in handler
    return func(self, *args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 87, in handler
    raise e
  File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 75, in handler
    return func(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 358, in bulk_insert
    raise err
  File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line [34](https://github.com/milvus-io/milvus/runs/6867486184?check_suite_focus=true#step:13:35)8, in bulk_insert
    response = rf.result()
  File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/grpc/_channel.py", line 744, in result
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "{"created":"@1655146547.110736288","description":"Error received from peer ipv4:127.0.0.1:19530","file":"src/core/lib/surface/call.cc","file_line":1067,"grpc_message":"Deadline Exceeded","grpc_status":4}"
>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/runner/work/milvus/milvus/tests/python_client/utils/api_request.py", line 22, in inner_wrapper
    res = func(*args, **kwargs)
  File "/home/runner/work/milvus/milvus/tests/python_client/utils/api_request.py", line 56, in api_request
    return func(*arg, **kwargs)
  File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/orm/collection.py", line 529, in insert
    res = conn.bulk_insert(self._name, entities, partition_name, ids=None, timeout=timeout, **kwargs)
  File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 55, in handler
    raise MilvusException(Status.UNEXPECTED_ERROR, "rpc timeout")
pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=rpc timeout)>
 (api_request.py:[35](https://github.com/milvus-io/milvus/runs/6867486184?check_suite_focus=true#step:13:36))
[2022-06-13 18:55:47 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=rpc timeout)> (api_request.py:[36](https://github.com/milvus-io/milvus/runs/6867486184?check_suite_focus=true#step:13:37))
FAILED

Expected Behavior

all test cases passed

Steps To Reproduce

see https://github.com/milvus-io/milvus/runs/6867486184?check_suite_focus=true

Milvus Log

failed job: https://github.com/milvus-io/milvus/runs/6867486184?check_suite_focus=true
log: https://github.com/milvus-io/milvus/suites/6913763113/artifacts/268667204

Anything else?

some other issues caused by datanode pod kill

#17335

#17366

@zhuwenxing zhuwenxing added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 14, 2022
@yanliang567
Copy link
Contributor

similar insert failure in issue #17524.

/assign @bigsheeper
/unassign

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 14, 2022
@yanliang567 yanliang567 added this to the 2.1-RC1 milestone Jun 14, 2022
@zhuwenxing
Copy link
Contributor Author

@soothing-rain
Copy link
Contributor

DataCoord crashed and was not able to come back.

Events:
  Type     Reason       Age                 From               Message
  ----     ------       ----                ----               -------
  Normal   Scheduled    16m                 default-scheduler  Successfully assigned chaos-testing/test-datanode-pod-kill-milvus-datacoord-68ccd75845-czbxp to chart-testing-control-plane
  Warning  FailedMount  16m (x2 over 16m)   kubelet            MountVolume.SetUp failed for volume "milvus-config" : failed to sync configmap cache: timed out waiting for the condition
  Normal   Pulling      16m                 kubelet            Pulling image "milvusdb/milvus-dev:master-latest"
  Normal   Pulled       15m                 kubelet            Successfully pulled image "milvusdb/milvus-dev:master-latest" in 46.644608056s
  Normal   Created      15m                 kubelet            Created container datacoord
  Normal   Started      15m                 kubelet            Started container datacoord
  Warning  Unhealthy    11m (x5 over 13m)   kubelet            Liveness probe failed: HTTP probe failed with statuscode: 500
  Normal   Killing      11m                 kubelet            Container datacoord failed liveness probe, will be restarted
  Warning  Unhealthy    10m (x14 over 13m)  kubelet            Readiness probe failed: HTTP probe failed with statuscode: 500

Could be the same in #17335

@bigsheeper
Copy link
Contributor

DataCoord keep creating consumer and closing consumer over and over again:

time="2022-06-14T18:51:23Z" level=info msg="[Connected consumer]" consumerID=10985 name=dgnjw subscription=by-dev-dataNode-30-433909232513318913 topic="persistent://public/default/by-dev-rootcoord-dml_53"
time="2022-06-14T18:51:23Z" level=info msg="[Created consumer]" consumerID=10985 name=dgnjw subscription=by-dev-dataNode-30-433909232513318913 topic="persistent://public/default/by-dev-rootcoord-dml_53"
[2022/06/14 18:51:23.505 +00:00] [INFO] [mq_msgstream.go:176] ["Successfully create consumer"] [channel=by-dev-rootcoord-dml_53] [subname=by-dev-dataNode-30-433909232513318913]
time="2022-06-14T18:51:23Z" level=info msg="The consumer[10985] successfully unsubscribed" consumerID=10985 name=dgnjw subscription=by-dev-dataNode-30-433909232513318913 topic="persistent://public/default/by-dev-rootcoord-dml_53"
time="2022-06-14T18:51:23Z" level=info msg="Closing consumer=10985" consumerID=10985 name=dgnjw subscription=by-dev-dataNode-30-433909232513318913 topic="persistent://public/default/by-dev-rootcoord-dml_53"
time="2022-06-14T18:51:23Z" level=info msg="[Closed consumer]" consumerID=10985 name=dgnjw subscription=by-dev-dataNode-30-433909232513318913 topic="persistent://public/default/by-dev-rootcoord-dml_53"
[2022/06/14 18:51:23.512 +00:00] [INFO] [msgstream_util.go:29] ["unsubscribe channel"] [subname=by-dev-dataNode-30-433909232513318913] [channels="[by-dev-rootcoord-dml_53]"]
time="2022-06-14T18:51:23Z" level=info msg="[Connected consumer]" consumerID=10986 name=tqqhl subscription=by-dev-dataNode-30-433909232513318913 topic="persistent://public/default/by-dev-rootcoord-dml_53"
time="2022-06-14T18:51:23Z" level=info msg="[Created consumer]" consumerID=10986 name=tqqhl subscription=by-dev-dataNode-30-433909232513318913 topic="persistent://public/default/by-dev-rootcoord-dml_53"
[2022/06/14 18:51:23.528 +00:00] [INFO] [mq_msgstream.go:176] ["Successfully create consumer"] [channel=by-dev-rootcoord-dml_53] [subname=by-dev-dataNode-30-433909232513318913]
time="2022-06-14T18:51:23Z" level=info msg="The consumer[10986] successfully unsubscribed" consumerID=10986 name=tqqhl subscription=by-dev-dataNode-30-433909232513318913 topic="persistent://public/default/by-dev-rootcoord-dml_53"
time="2022-06-14T18:51:23Z" level=info msg="Closing consumer=10986" consumerID=10986 name=tqqhl subscription=by-dev-dataNode-30-433909232513318913 topic="persistent://public/default/by-dev-rootcoord-dml_53"
time="2022-06-14T18:51:23Z" level=info msg="[Closed consumer]" consumerID=10986 name=tqqhl subscription=by-dev-dataNode-30-433909232513318913 topic="persistent://public/default/by-dev-rootcoord-dml_53"
[2022/06/14 18:51:23.533 +00:00] [INFO] [msgstream_util.go:29] ["unsubscribe channel"] [subname=by-dev-dataNode-30-433909232513318913] [channels="[by-dev-rootcoord-dml_53]"]
time="2022-06-14T18:51:23Z" level=info msg="[Connected consumer]" consumerID=10987 name=teclh subscription=by-dev-dataNode-30-433909232513318913 topic="persistent://public/default/by-dev-rootcoord-dml_53"
time="2022-06-14T18:51:23Z" level=info msg="[Created consumer]" consumerID=10987 name=teclh subscription=by-dev-dataNode-30-433909232513318913 topic="persistent://public/default/by-dev-rootcoord-dml_53"

@zhuwenxing
Copy link
Contributor Author

XuanYang-cn added a commit to XuanYang-cn/milvus that referenced this issue Jun 21, 2022
See also: milvus-io#17537

Signed-off-by: yangxuan <xuan.yang@zilliz.com>
@zhuwenxing
Copy link
Contributor Author

sre-ci-robot pushed a commit that referenced this issue Jun 22, 2022
See also: #17537

Signed-off-by: yangxuan <xuan.yang@zilliz.com>
@bigsheeper
Copy link
Contributor

bigsheeper commented Jun 22, 2022

DataCoord unsubscribe the same channel over and over again:

image

There should be many duplicated channels in NodeChannelInfo's Channels, which is logically incorrect.

image

Guess DataCoord cost much time to unsubscribe these plenty of duplicatehed channels, which caused DataCoord to failed to start or evict old DataNode.

@zhuwenxing zhuwenxing added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. labels Jun 22, 2022
@xiaofan-luan
Copy link
Collaborator

@bigsheeper why there is duplicated channels?

@bigsheeper
Copy link
Contributor

pr has merged, could you please make a verify? @zhuwenxing

@bigsheeper
Copy link
Contributor

/assign @zhuwenxing

@zhuwenxing
Copy link
Contributor Author

Not reproduced recently yet. Remove critical label.

@zhuwenxing zhuwenxing removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. labels Jun 27, 2022
@stale
Copy link

stale bot commented Jul 27, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label Jul 27, 2022
@stale stale bot closed this as completed Aug 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

5 participants