Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [one pod standalone]when Milvus recovers from pod kill chaos, most of its interfaces are not available #30314

Open
1 task done
zhuwenxing opened this issue Jan 26, 2024 · 15 comments · Fixed by #32048
Assignees
Labels
kind/bug Issues or changes related a bug severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@zhuwenxing
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:master-20240126-7ced0af1-amd64
- Deployment mode(standalone or cluster):standalone
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

  1. all search/query failed:collection not loaded
  2. flush all failed:

[2024-01-26T07:50:39.687Z] <name>: Hello_Milvus

[2024-01-26T07:50:39.687Z] <description>: 

[2024-01-26T07:50:39.687Z] <schema>: {'auto_id': False, 'description': '', 'fields': [{'name': 'int64', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float', 'description': '', 'type': <DataType.FLOAT: 10>}, {'n......  (api_request.py:37)

[2024-01-26T07:50:39.687Z] [2024-01-26 07:47:32 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:62)

[2024-01-26T07:50:39.687Z] [2024-01-26 07:50:32 - ERROR - pymilvus.decorators]: RPC error: [flush], <MilvusException: (code=500, message=channel not found[channel=by-dev-rootcoord-dml_2_447284044378545664v0])>, <Time:{'RPC start': '2024-01-26 07:47:32.703287', 'RPC error': '2024-01-26 07:50:32.649750'}> (decorators.py:134)

[2024-01-26T07:50:39.687Z] [2024-01-26 07:50:32 - ERROR - ci_test]: Traceback (most recent call last):

[2024-01-26T07:50:39.687Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper

[2024-01-26T07:50:39.688Z]     res = func(*args, **_kwargs)

[2024-01-26T07:50:39.688Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request

[2024-01-26T07:50:39.688Z]     return func(*arg, **kwargs)

[2024-01-26T07:50:39.688Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 314, in flush

[2024-01-26T07:50:39.688Z]     conn.flush([self.name], timeout=timeout, **kwargs)

[2024-01-26T07:50:39.688Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 135, in handler

[2024-01-26T07:50:39.688Z]     raise e from e

[2024-01-26T07:50:39.688Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 131, in handler

[2024-01-26T07:50:39.688Z]     return func(*args, **kwargs)

[2024-01-26T07:50:39.688Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 170, in handler

[2024-01-26T07:50:39.688Z]     return func(self, *args, **kwargs)

[2024-01-26T07:50:39.688Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 110, in handler

[2024-01-26T07:50:39.688Z]     raise e from e

[2024-01-26T07:50:39.688Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 74, in handler

[2024-01-26T07:50:39.688Z]     return func(*args, **kwargs)

[2024-01-26T07:50:39.688Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 1335, in flush

[2024-01-26T07:50:39.688Z]     check_status(response.status)

[2024-01-26T07:50:39.688Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/utils.py", line 58, in check_status

[2024-01-26T07:50:39.688Z]     raise MilvusException(status.code, status.reason, status.error_code)

[2024-01-26T07:50:39.688Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=500, message=channel not found[channel=by-dev-rootcoord-dml_2_447284044378545664v0])>

[2024-01-26T07:50:39.688Z]  (api_request.py:45)

[2024-01-26T07:53:36.970Z] <name>: QueryChecker__q3nvggGS

[2024-01-26T07:53:36.970Z] <description>: 

[2024-01-26T07:53:36.970Z] <schema>: {'auto_id': False, 'description': '', 'fields': [{'name': 'int64', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float', 'description': '', 'type': <DataType.FLOAT:......  (api_request.py:37)

[2024-01-26T07:53:36.970Z] [2024-01-26 07:44:54 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:62)

[2024-01-26T07:53:36.971Z] [2024-01-26 07:47:32 - ERROR - pymilvus.decorators]: RPC error: [flush], <MilvusException: (code=65535, message=failed to flush collection 447284044378546654: etcdserver: mvcc: database space exceeded)>, <Time:{'RPC start': '2024-01-26 07:44:54.643498', 'RPC error': '2024-01-26 07:47:32.568409'}> (decorators.py:134)

[2024-01-26T07:53:36.971Z] [2024-01-26 07:47:32 - ERROR - ci_test]: Traceback (most recent call last):

[2024-01-26T07:53:36.971Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper

[2024-01-26T07:53:36.971Z]     res = func(*args, **_kwargs)

[2024-01-26T07:53:36.971Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request

[2024-01-26T07:53:36.971Z]     return func(*arg, **kwargs)

[2024-01-26T07:53:36.971Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 314, in flush

[2024-01-26T07:53:36.971Z]     conn.flush([self.name], timeout=timeout, **kwargs)

[2024-01-26T07:53:36.971Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 135, in handler

[2024-01-26T07:53:36.971Z]     raise e from e

[2024-01-26T07:53:36.971Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 131, in handler

[2024-01-26T07:53:36.971Z]     return func(*args, **kwargs)

[2024-01-26T07:53:36.971Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 170, in handler

[2024-01-26T07:53:36.971Z]     return func(self, *args, **kwargs)

[2024-01-26T07:53:36.971Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 110, in handler

[2024-01-26T07:53:36.971Z]     raise e from e

[2024-01-26T07:53:36.971Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 74, in handler

[2024-01-26T07:53:36.971Z]     return func(*args, **kwargs)

[2024-01-26T07:53:36.971Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 1335, in flush

[2024-01-26T07:53:36.971Z]     check_status(response.status)

[2024-01-26T07:53:36.971Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/utils.py", line 58, in check_status

[2024-01-26T07:53:36.971Z]     raise MilvusException(status.code, status.reason, status.error_code)

[2024-01-26T07:53:36.971Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=65535, message=failed to flush collection 447284044378546654: etcdserver: mvcc: database space exceeded)>

[2024-01-26T07:53:36.971Z]  (api_request.py:45)
  1. create new collections failed:
[2024-01-26T07:44:41.455Z] [2024-01-26 07:44:41 - DEBUG - ci_test]: (api_request)  : [Collection] args: ['e2e__6Q9S3j7j', {'auto_id': False, 'description': '', 'fields': [{'name': 'int64', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float', 'description': '', 'type': <DataType.FLOAT: 10>}, {'name': 'varchar', 'description': '', 'type': <DataType.VAR......, kwargs: {'consistency_level': 'Strong'} (api_request.py:62)

[2024-01-26T07:44:41.455Z] [2024-01-26 07:44:41 - ERROR - pymilvus.decorators]: RPC error: [create_collection], <MilvusException: (code=65535, message=etcdserver: mvcc: database space exceeded)>, <Time:{'RPC start': '2024-01-26 07:44:41.083880', 'RPC error': '2024-01-26 07:44:41.086827'}> (decorators.py:134)

[2024-01-26T07:44:41.455Z] [2024-01-26 07:44:41 - ERROR - ci_test]: Traceback (most recent call last):

[2024-01-26T07:44:41.455Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper

[2024-01-26T07:44:41.455Z]     res = func(*args, **_kwargs)

[2024-01-26T07:44:41.455Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request

[2024-01-26T07:44:41.455Z]     return func(*arg, **kwargs)

[2024-01-26T07:44:41.455Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 147, in __init__

[2024-01-26T07:44:41.455Z]     conn.create_collection(self._name, schema, **kwargs)

[2024-01-26T07:44:41.455Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 135, in handler

[2024-01-26T07:44:41.455Z]     raise e from e

[2024-01-26T07:44:41.455Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 131, in handler

[2024-01-26T07:44:41.455Z]     return func(*args, **kwargs)

[2024-01-26T07:44:41.455Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 170, in handler

[2024-01-26T07:44:41.455Z]     return func(self, *args, **kwargs)

[2024-01-26T07:44:41.455Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 110, in handler

[2024-01-26T07:44:41.455Z]     raise e from e

[2024-01-26T07:44:41.455Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 74, in handler

[2024-01-26T07:44:41.455Z]     return func(*args, **kwargs)

[2024-01-26T07:44:41.455Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 304, in create_collection

[2024-01-26T07:44:41.455Z]     check_status(status)

[2024-01-26T07:44:41.455Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/utils.py", line 58, in check_status

[2024-01-26T07:44:41.455Z]     raise MilvusException(status.code, status.reason, status.error_code)

[2024-01-26T07:44:41.455Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=65535, message=etcdserver: mvcc: database space exceeded)>

[2024-01-26T07:44:41.455Z]  (api_request.py:45)

[2024-01-26T07:44:41.455Z] [2024-01-26 07:44:41 - ERROR - ci_test]: (api_response) : <MilvusException: (code=65535, message=etcdserver: mvcc: database space exceeded)> (api_request.py:46)

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed job:https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/10981/pipeline

log:
artifacts-one-pod-standalone-pod-kill-10981-server-logs.tar.gz

Anything else?

No response

@zhuwenxing zhuwenxing added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 26, 2024
@zhuwenxing zhuwenxing added this to the 2.4.0 milestone Jan 26, 2024
@zhuwenxing zhuwenxing added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. labels Jan 26, 2024
@zhuwenxing
Copy link
Contributor Author

/assign @LoveEachDay

PTAL

@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 26, 2024
@yanliang567 yanliang567 removed their assignment Jan 26, 2024
@xiaofan-luan
Copy link
Collaborator

database space exceeded
seems that the etcd fails

@zhuwenxing
Copy link
Contributor Author

zhuwenxing commented Mar 6, 2024

the error message has changed in master-20240305-3c9ffded
failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/12157/pipeline
log:
artifacts-one-pod-standalone-pod-kill-12157-server-logs.tar.gz


[2024-03-05T17:16:34.268Z] <name>: Hello_Milvus

[2024-03-05T17:16:34.268Z] <description>: 

[2024-03-05T17:16:34.268Z] <schema>: {'auto_id': False, 'description': '', 'fields': [{'name': 'int64', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float', 'description': '', 'type': <DataType.FLOAT: 10>}, {'n......  (api_request.py:37)

[2024-03-05T17:16:34.268Z] [2024-03-05 17:13:27 - DEBUG - ci_test]: (api_request)  : [Collection.flush] args: [], kwargs: {'timeout': 180} (api_request.py:62)

[2024-03-05T17:16:34.268Z] [2024-03-05 17:16:25 - ERROR - pymilvus.decorators]: RPC error: [flush], <MilvusException: (code=500, message=channel not found[channel=by-dev-rootcoord-dml_2_448176180282400233v0])>, <Time:{'RPC start': '2024-03-05 17:13:27.030313', 'RPC error': '2024-03-05 17:16:25.634781'}> (decorators.py:134)

[2024-03-05T17:16:34.268Z] [2024-03-05 17:16:25 - ERROR - ci_test]: Traceback (most recent call last):

[2024-03-05T17:16:34.268Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 32, in inner_wrapper

[2024-03-05T17:16:34.268Z]     res = func(*args, **_kwargs)

[2024-03-05T17:16:34.268Z]   File "/home/jenkins/agent/workspace/tests/python_client/utils/api_request.py", line 63, in api_request

[2024-03-05T17:16:34.268Z]     return func(*arg, **kwargs)

[2024-03-05T17:16:34.268Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/orm/collection.py", line 314, in flush

[2024-03-05T17:16:34.268Z]     conn.flush([self.name], timeout=timeout, **kwargs)

[2024-03-05T17:16:34.268Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 135, in handler

[2024-03-05T17:16:34.268Z]     raise e from e

[2024-03-05T17:16:34.268Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 131, in handler

[2024-03-05T17:16:34.268Z]     return func(*args, **kwargs)

[2024-03-05T17:16:34.268Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 170, in handler

[2024-03-05T17:16:34.268Z]     return func(self, *args, **kwargs)

[2024-03-05T17:16:34.268Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 110, in handler

[2024-03-05T17:16:34.268Z]     raise e from e

[2024-03-05T17:16:34.268Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/decorators.py", line 74, in handler

[2024-03-05T17:16:34.268Z]     return func(*args, **kwargs)

[2024-03-05T17:16:34.268Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/grpc_handler.py", line 1396, in flush

[2024-03-05T17:16:34.268Z]     check_status(response.status)

[2024-03-05T17:16:34.268Z]   File "/usr/local/lib/python3.8/dist-packages/pymilvus/client/utils.py", line 60, in check_status

[2024-03-05T17:16:34.268Z]     raise MilvusException(status.code, status.reason, status.error_code)

[2024-03-05T17:16:34.268Z] pymilvus.exceptions.MilvusException: <MilvusException: (code=500, message=channel not found[channel=by-dev-rootcoord-dml_2_448176180282400233v0])>

[2024-03-05T17:16:34.268Z]  (api_request.py:45)

[2024-03-05T17:16:34.268Z] [2024-03-05 17:16:25 - ERROR - ci_test]: (api_response) : <MilvusException: (code=500, message=channel not found[channel=by-dev-rootcoord-dml_2_448176180282400233v0])> (api_request.py:46)

@LoveEachDay
Copy link
Contributor

database space exceeded seems that the etcd fails

We'd change the default etcd settings for embedding mode.
@pingliu Please add the following config to embedEtcd.yaml:

quota-backend-bytes: '4294967296'
auto-compaction-mode: 'revision'
auto-compaction-retention: '1000'

@zhuwenxing
Copy link
Contributor Author

still reproduced
see #30545 (comment)

@LoveEachDay
Copy link
Contributor

LoveEachDay commented Apr 9, 2024

@LoveEachDay We'd change the auto compaction config for embed etcd which will mitigate the mvcc: database space exceeded problem.

sre-ci-robot pushed a commit that referenced this issue Apr 12, 2024
Fix #30314

Signed-off-by: Edward Zeng <jie.zeng@zilliz.com>
@zhuwenxing zhuwenxing reopened this Apr 15, 2024
@zhuwenxing
Copy link
Contributor Author

the mvcc: database space exceeded problem was not reproduced after #32048

but channel not found problem was still reproduced

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/13356/pipeline
log:
artifacts-one-pod-standalone-pod-kill-13356-server-logs.tar.gz

@xiaofan-luan
Copy link
Collaborator

/assign @weiliu1031

@XuanYang-cn
Copy link
Contributor

/assign

@yanliang567 yanliang567 modified the milestones: 2.4.11, 2.4.12 Sep 18, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.12, 2.4.13 Sep 27, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.13, 2.4.14 Oct 15, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.14, 2.4.16 Nov 14, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.16, 2.4.17, 2.4.18 Nov 21, 2024
@XuanYang-cn
Copy link
Contributor

/unassign @LoveEachDay @XuanYang-cn @weiliu1031
/assign @zhuwenxing
Is this still a problem? If so please assign to me later.

@zhuwenxing
Copy link
Contributor Author

@zhuwenxing
Copy link
Contributor Author

/assign @XuanYang-cn
/unassign

@yanliang567 yanliang567 modified the milestones: 2.4.18, 2.4.19, 2.4.20 Dec 24, 2024
@yanliang567 yanliang567 modified the milestones: 2.4.20, 2.4.21 Jan 6, 2025
@zhuwenxing
Copy link
Contributor Author

still reproduced in master-20250106-f0cddfd1-amd64
failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/20053/pipeline

log:
artifacts-one-pod-standalone-pod-kill-20053-server-logs.tar.gz

pod info

one-pod-standalone-pod-kill-20053-milvus-standalone-548967nwr4v   1/1     Running                  0                  19m     10.104.23.41    4am-node27   <none>           <none>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants