Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [Nightly] Milvus deployment failed because of pods crash #37404

Closed
1 task done
NicoYuan1986 opened this issue Nov 4, 2024 · 13 comments
Closed
1 task done

[Bug]: [Nightly] Milvus deployment failed because of pods crash #37404

NicoYuan1986 opened this issue Nov 4, 2024 · 13 comments
Assignees
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@NicoYuan1986
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: master(0449c74)
- Deployment mode(standalone or cluster):cluster&&standalone
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Milvus deployment failed because of pods crash.
link: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI(new)/detail/master/167/pipeline/151

[milvus-install : deploy] Failed
task check-install-status has failed: "step-check-status" exited with code 1 (image: "harbor.milvus.io/milvusdb/bitnami/kubectl@sha256:95276c7786df8cc7aef7ed874b56084b1c7047858f82c2db0bca5f5f4eb5d9df"); for logs run: kubectl -n milvus-tekton logs milvus-pytest-6nh45-check-install-status-pod -c step-check-status
[check-install-status : check-status] ----------------Pod Status --------------------------------------------
[check-install-status : check-status] mdk-master-167-py-n-etcd-0                                  1/1     Running            0               13m    10.104.15.158   4am-node20   <none>           <none>
[check-install-status : check-status] mdk-master-167-py-n-etcd-1                                  1/1     Running            0               13m    10.104.23.153   4am-node27   <none>           <none>
[check-install-status : check-status] mdk-master-167-py-n-etcd-2                                  1/1     Running            0               13m    10.104.30.24    4am-node38   <none>           <none>
[check-install-status : check-status] mdk-master-167-py-n-kafka-0                                 2/2     Running            3 (12m ago)     13m    10.104.33.63    4am-node36   <none>           <none>
[check-install-status : check-status] mdk-master-167-py-n-kafka-1                                 2/2     Running            2 (12m ago)     13m    10.104.15.161   4am-node20   <none>           <none>
[check-install-status : check-status] mdk-master-167-py-n-kafka-2                                 2/2     Running            2 (12m ago)     13m    10.104.24.253   4am-node29   <none>           <none>
[check-install-status : check-status] mdk-master-167-py-n-kafka-exporter-556fc7f678-7ztp4         1/1     Running            5 (12m ago)     13m    10.104.6.11     4am-node13   <none>           <none>
[check-install-status : check-status] mdk-master-167-py-n-milvus-datanode-59cdd5b54b-9ltz4        0/1     CrashLoopBackOff   7 (73s ago)     13m    10.104.5.13     4am-node12   <none>           <none>
[check-install-status : check-status] mdk-master-167-py-n-milvus-datanode-59cdd5b54b-jrwjd        0/1     CrashLoopBackOff   7 (73s ago)     13m    10.104.6.9      4am-node13   <none>           <none>
[check-install-status : check-status] mdk-master-167-py-n-milvus-indexnode-86bfbdb68c-77sgj       0/1     CrashLoopBackOff   7 (67s ago)     13m    10.104.6.10     4am-node13   <none>           <none>
[check-install-status : check-status] mdk-master-167-py-n-milvus-indexnode-86bfbdb68c-7m8gz       0/1     CrashLoopBackOff   7 (70s ago)     13m    10.104.14.184   4am-node18   <none>           <none>
[check-install-status : check-status] mdk-master-167-py-n-milvus-mixcoord-84b744c9bd-hqljz        0/1     CrashLoopBackOff   7 (109s ago)    13m    10.104.6.12     4am-node13   <none>           <none>
[check-install-status : check-status] mdk-master-167-py-n-milvus-proxy-7dcdb9c558-4tgmh           0/1     CrashLoopBackOff   7 (86s ago)     13m    10.104.6.8      4am-node13   <none>           <none>
[check-install-status : check-status] mdk-master-167-py-n-milvus-proxy-7dcdb9c558-ntzvp           0/1     CrashLoopBackOff   7 (70s ago)     13m    10.104.14.185   4am-node18   <none>           <none>
[check-install-status : check-status] mdk-master-167-py-n-milvus-querynode-5475446485-vs2h9       0/1     CrashLoopBackOff   7 (63s ago)     13m    10.104.14.186   4am-node18   <none>           <none>
[check-install-status : check-status] mdk-master-167-py-n-milvus-querynode-5475446485-zzxdc       0/1     CrashLoopBackOff   7 (69s ago)     13m    10.104.6.13     4am-node13   <none>           <none>
[check-install-status : check-status] mdk-master-167-py-n-minio-6f56744d6d-hzrpj                  1/1     Running            0               13m    10.104.30.21    4am-node38   <none>           <none>
[check-install-status : check-status] mdk-master-167-py-n-zookeeper-0                             1/1     Running            0               13m    10.104.16.155   4am-node21   <none>           <none>
[check-install-status : check-status] mdk-master-167-py-n-zookeeper-1                             1/1     Running            0               13m    10.104.23.154   4am-node27   <none>           <none>
[check-install-status : check-status] mdk-master-167-py-n-zookeeper-2                             1/1     Running            0               13m    10.104.15.162   4am-node20   <none>           <none>

panic log:

2024-11-03T02:03:28.495306652+08:00 stdout F [2024/11/02 18:03:28.495 +00:00] [ERROR] [components/query_node.go:60] ["QueryNode starts error"] [error="context deadline exceeded"] [stack="github.com/milvus-io/milvus/cmd/components.(*QueryNode).Run\n\t/workspace/source/cmd/components/query_node.go:60\ngithub.com/milvus-io/milvus/cmd/roles.runComponent[...].func1\n\t/workspace/source/cmd/roles/roles.go:129"]
2024-11-03T02:03:28.49747112+08:00 stderr F panic: context deadline exceeded
2024-11-03T02:03:28.497484485+08:00 stderr F
2024-11-03T02:03:28.497489948+08:00 stderr F goroutine 231 gp=0xc001803500 m=18 mp=0xc001800808 [running]:
2024-11-03T02:03:28.497494659+08:00 stderr F panic({0x63eff60?, 0x9efa280?})
2024-11-03T02:03:28.497499328+08:00 stderr F    /go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.0.linux-amd64/src/runtime/panic.go:779 +0x158 fp=0xc001da3f70 sp=0xc001da3ec0 pc=0x1f82b18
2024-11-03T02:03:28.497502897+08:00 stderr F github.com/milvus-io/milvus/cmd/roles.runComponent[...].func1()
2024-11-03T02:03:28.497507757+08:00 stderr F    /workspace/source/cmd/roles/roles.go:130 +0x128 fp=0xc001da3fe0 sp=0xc001da3f70 pc=0x5cb7ea8
2024-11-03T02:03:28.497512032+08:00 stderr F runtime.goexit({})
2024-11-03T02:03:28.497516923+08:00 stderr F    /go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.0.linux-amd64/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc001da3fe8 sp=0xc001da3fe0 pc=0x1fc1981
2024-11-03T02:03:28.497520089+08:00 stderr F created by github.com/milvus-io/milvus/cmd/roles.runComponent[...] in goroutine 1
2024-11-03T02:03:28.497524327+08:00 stderr F    /workspace/source/cmd/roles/roles.go:118 +0x129
2024-11-03T02:03:28.497528931+08:00 stderr F
2024-11-03T02:03:28.49753247+08:00 stderr F goroutine 1 gp=0xc0000061c0 m=nil [semacquire]:
2024-11-03T02:03:28.497536373+08:00 stderr F runtime.gopark(0x128f9d24f5fbe04?, 0x120?, 0x0?, 0x0?, 0x120?)
2024-11-03T02:03:28.497550587+08:00 stderr F    /go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.0.linux-amd64/src/runtime/proc.go:402 +0xce fp=0xc0021c15b8 sp=0xc0021c1598 pc=0x1f8728e
2024-11-03T02:03:28.497561532+08:00 stderr F runtime.goparkunlock(...)

server log: artifacts-milvus-distributed-kafka-mdk-master-167-py-n-167-e2e-logs.tar.gz

Expected Behavior

deploy successfully

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

@NicoYuan1986 NicoYuan1986 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 4, 2024
@NicoYuan1986 NicoYuan1986 added this to the 2.5.0 milestone Nov 4, 2024
@NicoYuan1986
Copy link
Contributor Author

I will try to fix by updating milvus helm chart version to 4.2.18. (It is 4.2.8 now)

@NicoYuan1986
Copy link
Contributor Author

#37406

@yanliang567
Copy link
Contributor

yanliang567 commented Nov 4, 2024

maybe the same to #37402, the address is null

2024-11-03T02:11:02.374889923+08:00 stdout F [2024/11/02 18:11:02.374 +00:00] [INFO] [rootcoord/root_coord.go:159] ["update rootcoord state"] [state=Abnormal]
2024-11-03T02:11:02.375166299+08:00 stdout F [2024/11/02 18:11:02.375 +00:00] [INFO] [rootcoord/service.go:154] ["RootCoord listen on"] [address="[::]:19530"] [port=19530]
2024-11-03T02:11:02.375205971+08:00 stdout F [2024/11/02 18:11:02.375 +00:00] [INFO] [rootcoord/service.go:196] ["init params done.."]
2024-11-03T02:11:02.375338359+08:00 stdout F [2024/11/02 18:11:02.375 +00:00] [INFO] [etcd/etcd_util.go:52] ["create etcd client"] [useEmbedEtcd=false] [useSSL=false] [endpoints="[localhost:2379]"] [minVersion=1.3]
2024-11-03T02:11:02.375990525+08:00 stdout F [2024/11/02 18:11:02.375 +00:00] [WARN] [datacoord/service.go:95] ["DataCoord fail to create net listener"] [error="listen tcp :19530: bind: address already in use"]

@yanliang567
Copy link
Contributor

@congqixia & @LoveEachDay are working on it.

/assign @congqixia
/unassign

@sre-ci-robot sre-ci-robot assigned congqixia and unassigned yanliang567 Nov 4, 2024
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 4, 2024
congqixia added a commit to congqixia/milvus that referenced this issue Nov 4, 2024
See also milvus-io#37404 milvus-io#37402

IP address in paramtable need validation and fail fast with reasonable
error message

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
@xiaofan-luan
Copy link
Collaborator

why this need a fix?

this is just listen tcp :19530: bind: address already in use

which means the 19530 port has been used by other process

@xiaofan-luan
Copy link
Collaborator

don't think there is a issue here

@yanliang567
Copy link
Contributor

The error msg is a bit confusing, but the root cause here is the ip address was parsed to null for an empty config.

@yanliang567
Copy link
Contributor

I can manually deploy milvus now with the pr #37418, but the nightly run still failed in deployment with the same commit. @congqixia do we still need your fix pr above?

@congqixia
Copy link
Contributor

@yanliang567 the root cause shall be the misbehavior of yaml parser. #37418 shall be handling null case for yaml configuration. I shall check why the nightly failed with this patch

@congqixia
Copy link
Contributor

congqixia commented Nov 5, 2024

why this need a fix?

this is just listen tcp :19530: bind: address already in use

which means the 19530 port has been used by other process

@xiaofan-luan

if milvus failed to discover a viable ip address and put it in session, other component could never be able to connect to coordinators and vise versa

congqixia added a commit to congqixia/milvus that referenced this issue Nov 5, 2024
See also milvus-io#37404 milvus-io#37402

IP address in paramtable need validation and fail fast with reasonable
error message

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
congqixia added a commit to congqixia/milvus that referenced this issue Nov 5, 2024
Related to milvus-io#37404

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
@congqixia
Copy link
Contributor

The root cause was behavior change of yaml parsing.
Empty yaml file (with # comments, which means length not zero) will report error and make other yaml file fail to work.
#37445 skip EOF error to fix this problem

sre-ci-robot pushed a commit that referenced this issue Nov 5, 2024
Related to #37404

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
@congqixia
Copy link
Contributor

congqixia commented Nov 6, 2024

It seems that last nightly run started successfully, could you please verify?
/unassign
/assign @NicoYuan1986

@NicoYuan1986
Copy link
Contributor Author

The issue has been fixed. Thanks for all your help ~ 🌈

congqixia added a commit to congqixia/milvus that referenced this issue Nov 6, 2024
See also milvus-io#37404 milvus-io#37402

IP address in paramtable need validation and fail fast with reasonable
error message

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue Nov 7, 2024
See also #37404 #37402

IP address in paramtable need validation and fail fast with reasonable
error message

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
congqixia added a commit to congqixia/milvus that referenced this issue Nov 7, 2024
See also milvus-io#37404 milvus-io#37402

IP address in paramtable need validation and fail fast with reasonable
error message

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue Nov 11, 2024
Cherry-pick from master
pr: #37416
See also #37404 #37402

IP address in paramtable need validation and fail fast with reasonable
error message

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

4 participants