[Bug]: Insert timeout when running `test_e2e.py` after Milvus recovered from datanode pod kill chaos #17537

zhuwenxing · 2022-06-14T02:09:41Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version:master-20220613-e9dcda16
- Deployment mode(standalone or cluster):cluster
- SDK version(e.g. pymilvus v2.0.0rc2):pymilvus==2.1.0.dev69
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Insert timeout when running test_e2e.py after Milvus recovered from datanode pod kill chaos
Actually, create collection also cost a lot time, 23s.

[2022-06-13 18:55:26 - INFO - ci_test]: assert create collection: 23.063071489334106, init_entities: 0 (test_e2e.py:24)
[2022-06-13 18:55:47 - ERROR - pymilvus.decorators]: grpc RpcError: [bulk_insert], <_MultiThreadedRendezvous: StatusCode.DEADLINE_EXCEEDED, Deadline Exceeded>, <Time:{'RPC start': '2022-06-13 18:55:27.057163', 'gRPC error': '2022-06-13 18:55:47.111122'}> (decorators.py:86)
[2022-06-13 18:55:47 - ERROR - ci_test]: Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 44, in handler
    return func(self, *args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 87, in handler
    raise e
  File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 75, in handler
    return func(*args, **kwargs)
  File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 358, in bulk_insert
    raise err
  File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line [34](https://github.com/milvus-io/milvus/runs/6867486184?check_suite_focus=true#step:13:35)8, in bulk_insert
    response = rf.result()
  File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/grpc/_channel.py", line 744, in result
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "{"created":"@1655146547.110736288","description":"Error received from peer ipv4:127.0.0.1:19530","file":"src/core/lib/surface/call.cc","file_line":1067,"grpc_message":"Deadline Exceeded","grpc_status":4}"
>
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/runner/work/milvus/milvus/tests/python_client/utils/api_request.py", line 22, in inner_wrapper
    res = func(*args, **kwargs)
  File "/home/runner/work/milvus/milvus/tests/python_client/utils/api_request.py", line 56, in api_request
    return func(*arg, **kwargs)
  File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/orm/collection.py", line 529, in insert
    res = conn.bulk_insert(self._name, entities, partition_name, ids=None, timeout=timeout, **kwargs)
  File "/opt/hostedtoolcache/Python/3.8.12/x64/lib/python3.8/site-packages/pymilvus/decorators.py", line 55, in handler
    raise MilvusException(Status.UNEXPECTED_ERROR, "rpc timeout")
pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=rpc timeout)>
 (api_request.py:[35](https://github.com/milvus-io/milvus/runs/6867486184?check_suite_focus=true#step:13:36))
[2022-06-13 18:55:47 - ERROR - ci_test]: (api_response) : <MilvusException: (code=1, message=rpc timeout)> (api_request.py:[36](https://github.com/milvus-io/milvus/runs/6867486184?check_suite_focus=true#step:13:37))
FAILED

Expected Behavior

all test cases passed

Steps To Reproduce

see https://github.com/milvus-io/milvus/runs/6867486184?check_suite_focus=true

Milvus Log

failed job: https://github.com/milvus-io/milvus/runs/6867486184?check_suite_focus=true
log: https://github.com/milvus-io/milvus/suites/6913763113/artifacts/268667204

Anything else?

some other issues caused by datanode pod kill

#17335

#17366

The text was updated successfully, but these errors were encountered:

yanliang567 · 2022-06-14T02:27:42Z

similar insert failure in issue #17524.

/assign @bigsheeper
/unassign

zhuwenxing · 2022-06-15T02:52:43Z

in version master-20220614-f9553970
This issue still happened

failed job: https://github.com/milvus-io/milvus/runs/6886689798?check_suite_focus=true
log: https://github.com/milvus-io/milvus/suites/6931967462/artifacts/269839383

soothing-rain · 2022-06-15T05:33:49Z

DataCoord crashed and was not able to come back.

Events:
  Type     Reason       Age                 From               Message
  ----     ------       ----                ----               -------
  Normal   Scheduled    16m                 default-scheduler  Successfully assigned chaos-testing/test-datanode-pod-kill-milvus-datacoord-68ccd75845-czbxp to chart-testing-control-plane
  Warning  FailedMount  16m (x2 over 16m)   kubelet            MountVolume.SetUp failed for volume "milvus-config" : failed to sync configmap cache: timed out waiting for the condition
  Normal   Pulling      16m                 kubelet            Pulling image "milvusdb/milvus-dev:master-latest"
  Normal   Pulled       15m                 kubelet            Successfully pulled image "milvusdb/milvus-dev:master-latest" in 46.644608056s
  Normal   Created      15m                 kubelet            Created container datacoord
  Normal   Started      15m                 kubelet            Started container datacoord
  Warning  Unhealthy    11m (x5 over 13m)   kubelet            Liveness probe failed: HTTP probe failed with statuscode: 500
  Normal   Killing      11m                 kubelet            Container datacoord failed liveness probe, will be restarted
  Warning  Unhealthy    10m (x14 over 13m)  kubelet            Readiness probe failed: HTTP probe failed with statuscode: 500

Could be the same in #17335

bigsheeper · 2022-06-21T04:06:22Z

DataCoord keep creating consumer and closing consumer over and over again:

time="2022-06-14T18:51:23Z" level=info msg="[Connected consumer]" consumerID=10985 name=dgnjw subscription=by-dev-dataNode-30-433909232513318913 topic="persistent://public/default/by-dev-rootcoord-dml_53"
time="2022-06-14T18:51:23Z" level=info msg="[Created consumer]" consumerID=10985 name=dgnjw subscription=by-dev-dataNode-30-433909232513318913 topic="persistent://public/default/by-dev-rootcoord-dml_53"
[2022/06/14 18:51:23.505 +00:00] [INFO] [mq_msgstream.go:176] ["Successfully create consumer"] [channel=by-dev-rootcoord-dml_53] [subname=by-dev-dataNode-30-433909232513318913]
time="2022-06-14T18:51:23Z" level=info msg="The consumer[10985] successfully unsubscribed" consumerID=10985 name=dgnjw subscription=by-dev-dataNode-30-433909232513318913 topic="persistent://public/default/by-dev-rootcoord-dml_53"
time="2022-06-14T18:51:23Z" level=info msg="Closing consumer=10985" consumerID=10985 name=dgnjw subscription=by-dev-dataNode-30-433909232513318913 topic="persistent://public/default/by-dev-rootcoord-dml_53"
time="2022-06-14T18:51:23Z" level=info msg="[Closed consumer]" consumerID=10985 name=dgnjw subscription=by-dev-dataNode-30-433909232513318913 topic="persistent://public/default/by-dev-rootcoord-dml_53"
[2022/06/14 18:51:23.512 +00:00] [INFO] [msgstream_util.go:29] ["unsubscribe channel"] [subname=by-dev-dataNode-30-433909232513318913] [channels="[by-dev-rootcoord-dml_53]"]
time="2022-06-14T18:51:23Z" level=info msg="[Connected consumer]" consumerID=10986 name=tqqhl subscription=by-dev-dataNode-30-433909232513318913 topic="persistent://public/default/by-dev-rootcoord-dml_53"
time="2022-06-14T18:51:23Z" level=info msg="[Created consumer]" consumerID=10986 name=tqqhl subscription=by-dev-dataNode-30-433909232513318913 topic="persistent://public/default/by-dev-rootcoord-dml_53"
[2022/06/14 18:51:23.528 +00:00] [INFO] [mq_msgstream.go:176] ["Successfully create consumer"] [channel=by-dev-rootcoord-dml_53] [subname=by-dev-dataNode-30-433909232513318913]
time="2022-06-14T18:51:23Z" level=info msg="The consumer[10986] successfully unsubscribed" consumerID=10986 name=tqqhl subscription=by-dev-dataNode-30-433909232513318913 topic="persistent://public/default/by-dev-rootcoord-dml_53"
time="2022-06-14T18:51:23Z" level=info msg="Closing consumer=10986" consumerID=10986 name=tqqhl subscription=by-dev-dataNode-30-433909232513318913 topic="persistent://public/default/by-dev-rootcoord-dml_53"
time="2022-06-14T18:51:23Z" level=info msg="[Closed consumer]" consumerID=10986 name=tqqhl subscription=by-dev-dataNode-30-433909232513318913 topic="persistent://public/default/by-dev-rootcoord-dml_53"
[2022/06/14 18:51:23.533 +00:00] [INFO] [msgstream_util.go:29] ["unsubscribe channel"] [subname=by-dev-dataNode-30-433909232513318913] [channels="[by-dev-rootcoord-dml_53]"]
time="2022-06-14T18:51:23Z" level=info msg="[Connected consumer]" consumerID=10987 name=teclh subscription=by-dev-dataNode-30-433909232513318913 topic="persistent://public/default/by-dev-rootcoord-dml_53"
time="2022-06-14T18:51:23Z" level=info msg="[Created consumer]" consumerID=10987 name=teclh subscription=by-dev-dataNode-30-433909232513318913 topic="persistent://public/default/by-dev-rootcoord-dml_53"

zhuwenxing · 2022-06-21T04:53:41Z

version master-20220621-eb1f0bc8
failed job: https://github.com/zhuwenxing/milvus/runs/6977674345?check_suite_focus=true
log: https://github.com/zhuwenxing/milvus/suites/7016658817/artifacts/275608929

See also: milvus-io#17537 Signed-off-by: yangxuan <xuan.yang@zilliz.com>

zhuwenxing · 2022-06-22T01:56:35Z

version master-20220621-746aeea3
failed job: https://github.com/milvus-io/milvus/runs/6990902763?check_suite_focus=true
log: https://github.com/milvus-io/milvus/suites/7029036529/artifacts/276443169

See also: #17537 Signed-off-by: yangxuan <xuan.yang@zilliz.com>

bigsheeper · 2022-06-22T07:39:04Z

DataCoord unsubscribe the same channel over and over again:

There should be many duplicated channels in NodeChannelInfo's Channels, which is logically incorrect.

Guess DataCoord cost much time to unsubscribe these plenty of duplicatehed channels, which caused DataCoord to failed to start or evict old DataNode.

zhuwenxing · 2022-06-23T02:02:10Z

failed job: https://github.com/milvus-io/milvus/runs/7010469294?check_suite_focus=true
log: https://github.com/milvus-io/milvus/suites/7047621582/artifacts/277627549

xiaofan-luan · 2022-06-24T08:35:32Z

@bigsheeper why there is duplicated channels?

bigsheeper · 2022-06-27T01:55:56Z

pr has merged, could you please make a verify? @zhuwenxing

bigsheeper · 2022-06-27T01:56:06Z

/assign @zhuwenxing

zhuwenxing · 2022-06-27T06:13:23Z

Not reproduced recently yet. Remove critical label.

stale · 2022-07-27T20:18:22Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

zhuwenxing added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 14, 2022

zhuwenxing assigned yanliang567 Jun 14, 2022

sre-ci-robot assigned bigsheeper and unassigned yanliang567 Jun 14, 2022

yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 14, 2022

yanliang567 added this to the 2.1-RC1 milestone Jun 14, 2022

bigsheeper mentioned this issue Jun 14, 2022

Disable jemalloc and use malloc_trim instead #17538

Merged

XuanYang-cn mentioned this issue Jun 21, 2022

Clear segment cache when closing flowgraph #17671

Merged

XuanYang-cn added a commit to XuanYang-cn/milvus that referenced this issue Jun 21, 2022

Clear segment cache when closing flowgraph

358cdde

See also: milvus-io#17537 Signed-off-by: yangxuan <xuan.yang@zilliz.com>

sre-ci-robot pushed a commit that referenced this issue Jun 22, 2022

Clear segment cache when closing flowgraph (#17671)

1215843

See also: #17537 Signed-off-by: yangxuan <xuan.yang@zilliz.com>

zhuwenxing added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. labels Jun 22, 2022

bigsheeper mentioned this issue Jun 23, 2022

Check duplication before adding channel info in DataCoord #17742

Merged

bigsheeper mentioned this issue Jun 24, 2022

Check duplication before adding channel info in DataCoord #17784

Merged

sre-ci-robot assigned zhuwenxing Jun 27, 2022

zhuwenxing removed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. labels Jun 27, 2022

stale bot added the stale indicates no udpates for 30 days label Jul 27, 2022

stale bot closed this as completed Aug 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Insert timeout when running `test_e2e.py` after Milvus recovered from datanode pod kill chaos #17537

[Bug]: Insert timeout when running `test_e2e.py` after Milvus recovered from datanode pod kill chaos #17537

zhuwenxing commented Jun 14, 2022

yanliang567 commented Jun 14, 2022

zhuwenxing commented Jun 15, 2022

soothing-rain commented Jun 15, 2022

bigsheeper commented Jun 21, 2022

zhuwenxing commented Jun 21, 2022

zhuwenxing commented Jun 22, 2022

bigsheeper commented Jun 22, 2022 •

edited

Loading

zhuwenxing commented Jun 23, 2022

xiaofan-luan commented Jun 24, 2022

bigsheeper commented Jun 27, 2022

bigsheeper commented Jun 27, 2022

zhuwenxing commented Jun 27, 2022

stale bot commented Jul 27, 2022

[Bug]: Insert timeout when running test_e2e.py after Milvus recovered from datanode pod kill chaos #17537

[Bug]: Insert timeout when running test_e2e.py after Milvus recovered from datanode pod kill chaos #17537

Comments

zhuwenxing commented Jun 14, 2022

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

yanliang567 commented Jun 14, 2022

zhuwenxing commented Jun 15, 2022

soothing-rain commented Jun 15, 2022

bigsheeper commented Jun 21, 2022

zhuwenxing commented Jun 21, 2022

zhuwenxing commented Jun 22, 2022

bigsheeper commented Jun 22, 2022 • edited Loading

zhuwenxing commented Jun 23, 2022

xiaofan-luan commented Jun 24, 2022

bigsheeper commented Jun 27, 2022

bigsheeper commented Jun 27, 2022

zhuwenxing commented Jun 27, 2022

stale bot commented Jul 27, 2022

[Bug]: Insert timeout when running `test_e2e.py` after Milvus recovered from datanode pod kill chaos #17537

[Bug]: Insert timeout when running `test_e2e.py` after Milvus recovered from datanode pod kill chaos #17537

bigsheeper commented Jun 22, 2022 •

edited

Loading