Skip to content

[Bug]: StreamingNode crashed after Milvus recovered from the mixed coordinator pod kill chaos test #39888

Closed
@zhuwenxing

Description

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:master-20250213-dccba87f-amd64
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

[2025/02/14 01:27:06.023 +00:00] [INFO] [flusherimpl/wal_flusher.go:128] ["data coord client ready"] [module=streamingnode] [component=flusher] [pchannel=by-dev-rootcoord-dml_0]
[2025/02/14 01:27:06.023 +00:00] [INFO] [syncmgr/sync_manager.go:70] ["sync manager initialized"] [initPoolSize=256]
[2025/02/14 01:27:06.023 +00:00] [INFO] [flusherimpl/wal_flusher.go:121] ["fetch recovery info done"] [module=streamingnode] [component=flusher] [pchannel=by-dev-rootcoord-dml_7] [recoveryInfoNum=3]
[2025/02/14 01:27:06.023 +00:00] [INFO] [flusherimpl/wal_flusher.go:128] ["data coord client ready"] [module=streamingnode] [component=flusher] [pchannel=by-dev-rootcoord-dml_7]
[2025/02/14 01:27:06.023 +00:00] [INFO] [syncmgr/sync_manager.go:70] ["sync manager initialized"] [initPoolSize=256]
[2025/02/14 01:27:06.023 +00:00] [INFO] [flusherimpl/wal_flusher.go:121] ["fetch recovery info done"] [module=streamingnode] [component=flusher] [pchannel=by-dev-rootcoord-dml_10] [recoveryInfoNum=3]
[2025/02/14 01:27:06.023 +00:00] [INFO] [flusherimpl/wal_flusher.go:128] ["data coord client ready"] [module=streamingnode] [component=flusher] [pchannel=by-dev-rootcoord-dml_10]
[2025/02/14 01:27:06.024 +00:00] [INFO] [syncmgr/sync_manager.go:70] ["sync manager initialized"] [initPoolSize=256]

SIGNAL CATCH BY NON-GO SIGNAL HANDLER
SIGNO: 11; SIGNAME: Segmentation fault; SI_CODE: 1; SI_ADDR: 0x28
BACKTRACE:
I20250214 01:27:08.028440    83 MinioChunkManager.cpp:225] [SERVER][PreCheck][milvus][]start to precheck chunk manager with configuration: [address=mixcoord-pod-kill-20295-minio:9000, bucket_name=milvus-bucket, root_path=file, storage_type=remote, cloud_provider=aws, iam_endpoint=, log_level=fatal, region=, useSSL=false, sslCACert=19, useIAM=false, useVirtualHost=false, requestTimeoutMs=10000, gcp_native_without_auth=false]
I20250214 01:27:08.036345    83 ChunkManager.cpp:112] [SERVER][AwsChunkManager][milvus][]init AwsChunkManager with parameter[endpoint=mixcoord-pod-kill-20295-minio:9000][bucket_name=milvus-bucket][root_path=file][use_secure=false]
github.com/milvus-io/milvus/internal/streamingnode/server/flusher/flusherimpl.recoverPChannelCheckpointManager
	/workspace/source/internal/streamingnode/server/flusher/flusherimpl/pchannel_checkpoint.go:34 pc=0x60d4592


[2025/02/14 01:27:10.861 +00:00] [WARN] [etcd/etcd_kv.go:663] ["Slow etcd operation load"] ["time spent"=4.839121788s] [key=by-dev/meta/streamingnode-meta/wal/by-dev-rootcoord-dml_2/consume-checkpoint]

SIGNAL CATCH BY NON-GO SIGNAL HANDLER
SIGNO: 11; SIGNAME: Segmentation fault; SI_CODE: 1; SI_ADDR: 0x28
BACKTRACE:
github.com/milvus-io/milvus/internal/streamingnode/server/flusher/flusherimpl.recoverPChannelCheckpointManager
	/workspace/source/internal/streamingnode/server/flusher/flusherimpl/pchannel_checkpoint.go:34 pc=0x60d4592


[2025/02/14 01:27:10.862 +00:00] [INFO] [flusherimpl/wal_flusher.go:52] ["wal flusher stop"] [module=streamingnode] [component=flusher] [pchannel=by-dev-rootcoord-dml_12]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x60d4592]

[2025-02-14T01:14:47.881Z] + kubectl get pods -o wide

[2025-02-14T01:14:47.883Z] + grep mixcoord-pod-kill-20295

[2025-02-14T01:14:47.883Z] mixcoord-pod-kill-20295-etcd-0                                    1/1     Running            0                37m     10.104.26.132   4am-node32   <none>           <none>

[2025-02-14T01:14:47.883Z] mixcoord-pod-kill-20295-etcd-1                                    1/1     Running            0                37m     10.104.19.162   4am-node28   <none>           <none>

[2025-02-14T01:14:47.883Z] mixcoord-pod-kill-20295-etcd-2                                    1/1     Running            0                37m     10.104.24.46    4am-node29   <none>           <none>

[2025-02-14T01:14:47.883Z] mixcoord-pod-kill-20295-milvus-datanode-694587bffd-5cj57          1/1     Running            2 (37m ago)      37m     10.104.23.175   4am-node27   <none>           <none>

[2025-02-14T01:14:47.883Z] mixcoord-pod-kill-20295-milvus-datanode-694587bffd-sgcd8          1/1     Running            2 (37m ago)      37m     10.104.16.195   4am-node21   <none>           <none>

[2025-02-14T01:14:47.883Z] mixcoord-pod-kill-20295-milvus-indexnode-7c4fc4f765-5vt2j         1/1     Running            2 (37m ago)      37m     10.104.17.48    4am-node23   <none>           <none>

[2025-02-14T01:14:47.883Z] mixcoord-pod-kill-20295-milvus-indexnode-7c4fc4f765-62994         1/1     Running            2 (37m ago)      37m     10.104.23.176   4am-node27   <none>           <none>

[2025-02-14T01:14:47.883Z] mixcoord-pod-kill-20295-milvus-indexnode-7c4fc4f765-n7wtg         1/1     Running            2 (37m ago)      37m     10.104.30.98    4am-node38   <none>           <none>

[2025-02-14T01:14:47.883Z] mixcoord-pod-kill-20295-milvus-mixcoord-7f9979c575-xwqbz          1/1     Running            2 (37m ago)      37m     10.104.30.93    4am-node38   <none>           <none>

[2025-02-14T01:14:47.883Z] mixcoord-pod-kill-20295-milvus-proxy-7784c5c47-kmvrs              1/1     Running            2 (37m ago)      37m     10.104.21.172   4am-node24   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-milvus-querynode-bd884bdf5-52kx2          1/1     Running            2 (37m ago)      37m     10.104.17.47    4am-node23   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-milvus-querynode-bd884bdf5-djmnh          1/1     Running            2 (37m ago)      37m     10.104.25.167   4am-node30   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-milvus-querynode-bd884bdf5-wj2m8          1/1     Running            2 (37m ago)      37m     10.104.21.171   4am-node24   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-milvus-streamingnode-7c756d75b5-s8whr     0/1     CrashLoopBackOff   11 (3m18s ago)   37m     10.104.23.174   4am-node27   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-minio-0                                   1/1     Running            0                37m     10.104.26.133   4am-node32   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-minio-1                                   1/1     Running            0                37m     10.104.19.163   4am-node28   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-minio-2                                   1/1     Running            0                37m     10.104.33.91    4am-node36   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-minio-3                                   1/1     Running            0                37m     10.104.15.101   4am-node20   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-pulsarv3-bookie-0                         1/1     Running            0                37m     10.104.26.131   4am-node32   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-pulsarv3-bookie-1                         1/1     Running            0                37m     10.104.24.43    4am-node29   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-pulsarv3-bookie-2                         1/1     Running            0                37m     10.104.15.100   4am-node20   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-pulsarv3-bookie-init-h4sgm                0/1     Completed          0                37m     10.104.14.236   4am-node18   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-pulsarv3-broker-0                         1/1     Running            0                37m     10.104.14.238   4am-node18   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-pulsarv3-broker-1                         1/1     Running            0                37m     10.104.26.124   4am-node32   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-pulsarv3-proxy-0                          1/1     Running            0                37m     10.104.26.123   4am-node32   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-pulsarv3-proxy-1                          1/1     Running            0                37m     10.104.14.237   4am-node18   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-pulsarv3-pulsar-init-ztslb                0/1     Completed          0                37m     10.104.14.232   4am-node18   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-pulsarv3-recovery-0                       1/1     Running            0                37m     10.104.14.230   4am-node18   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-pulsarv3-zookeeper-0                      1/1     Running            0                37m     10.104.19.160   4am-node28   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-pulsarv3-zookeeper-1                      1/1     Running            0                37m     10.104.26.130   4am-node32   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-pulsarv3-zookeeper-2                      1/1     Running            0                37m     10.104.24.44    4am-node29   <none>           <none>

Expected Behavior

No response

Steps To Reproduce

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/20295/pipeline
log:

artifacts-mixcoord-pod-kill-20295-server-logs.tar.gz

cluster: 4am
ns: chaos-testing
pod info

[2025-02-14T01:14:47.881Z] + kubectl get pods -o wide

[2025-02-14T01:14:47.883Z] + grep mixcoord-pod-kill-20295

[2025-02-14T01:14:47.883Z] mixcoord-pod-kill-20295-etcd-0                                    1/1     Running            0                37m     10.104.26.132   4am-node32   <none>           <none>

[2025-02-14T01:14:47.883Z] mixcoord-pod-kill-20295-etcd-1                                    1/1     Running            0                37m     10.104.19.162   4am-node28   <none>           <none>

[2025-02-14T01:14:47.883Z] mixcoord-pod-kill-20295-etcd-2                                    1/1     Running            0                37m     10.104.24.46    4am-node29   <none>           <none>

[2025-02-14T01:14:47.883Z] mixcoord-pod-kill-20295-milvus-datanode-694587bffd-5cj57          1/1     Running            2 (37m ago)      37m     10.104.23.175   4am-node27   <none>           <none>

[2025-02-14T01:14:47.883Z] mixcoord-pod-kill-20295-milvus-datanode-694587bffd-sgcd8          1/1     Running            2 (37m ago)      37m     10.104.16.195   4am-node21   <none>           <none>

[2025-02-14T01:14:47.883Z] mixcoord-pod-kill-20295-milvus-indexnode-7c4fc4f765-5vt2j         1/1     Running            2 (37m ago)      37m     10.104.17.48    4am-node23   <none>           <none>

[2025-02-14T01:14:47.883Z] mixcoord-pod-kill-20295-milvus-indexnode-7c4fc4f765-62994         1/1     Running            2 (37m ago)      37m     10.104.23.176   4am-node27   <none>           <none>

[2025-02-14T01:14:47.883Z] mixcoord-pod-kill-20295-milvus-indexnode-7c4fc4f765-n7wtg         1/1     Running            2 (37m ago)      37m     10.104.30.98    4am-node38   <none>           <none>

[2025-02-14T01:14:47.883Z] mixcoord-pod-kill-20295-milvus-mixcoord-7f9979c575-xwqbz          1/1     Running            2 (37m ago)      37m     10.104.30.93    4am-node38   <none>           <none>

[2025-02-14T01:14:47.883Z] mixcoord-pod-kill-20295-milvus-proxy-7784c5c47-kmvrs              1/1     Running            2 (37m ago)      37m     10.104.21.172   4am-node24   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-milvus-querynode-bd884bdf5-52kx2          1/1     Running            2 (37m ago)      37m     10.104.17.47    4am-node23   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-milvus-querynode-bd884bdf5-djmnh          1/1     Running            2 (37m ago)      37m     10.104.25.167   4am-node30   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-milvus-querynode-bd884bdf5-wj2m8          1/1     Running            2 (37m ago)      37m     10.104.21.171   4am-node24   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-milvus-streamingnode-7c756d75b5-s8whr     0/1     CrashLoopBackOff   11 (3m18s ago)   37m     10.104.23.174   4am-node27   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-minio-0                                   1/1     Running            0                37m     10.104.26.133   4am-node32   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-minio-1                                   1/1     Running            0                37m     10.104.19.163   4am-node28   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-minio-2                                   1/1     Running            0                37m     10.104.33.91    4am-node36   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-minio-3                                   1/1     Running            0                37m     10.104.15.101   4am-node20   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-pulsarv3-bookie-0                         1/1     Running            0                37m     10.104.26.131   4am-node32   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-pulsarv3-bookie-1                         1/1     Running            0                37m     10.104.24.43    4am-node29   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-pulsarv3-bookie-2                         1/1     Running            0                37m     10.104.15.100   4am-node20   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-pulsarv3-bookie-init-h4sgm                0/1     Completed          0                37m     10.104.14.236   4am-node18   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-pulsarv3-broker-0                         1/1     Running            0                37m     10.104.14.238   4am-node18   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-pulsarv3-broker-1                         1/1     Running            0                37m     10.104.26.124   4am-node32   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-pulsarv3-proxy-0                          1/1     Running            0                37m     10.104.26.123   4am-node32   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-pulsarv3-proxy-1                          1/1     Running            0                37m     10.104.14.237   4am-node18   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-pulsarv3-pulsar-init-ztslb                0/1     Completed          0                37m     10.104.14.232   4am-node18   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-pulsarv3-recovery-0                       1/1     Running            0                37m     10.104.14.230   4am-node18   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-pulsarv3-zookeeper-0                      1/1     Running            0                37m     10.104.19.160   4am-node28   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-pulsarv3-zookeeper-1                      1/1     Running            0                37m     10.104.26.130   4am-node32   <none>           <none>

[2025-02-14T01:14:47.884Z] mixcoord-pod-kill-20295-pulsarv3-zookeeper-2                      1/1     Running            0                37m     10.104.24.44    4am-node29   <none>           <none>

Anything else?

No response

Metadata

Assignees

Labels

feature/streaming nodestreaming node featurekind/bugIssues or changes related a bugpriority/critical-urgentHighest priority. Must be actively worked on as someone's top priority right now.severity/criticalCritical, lead to crash, data missing, wrong result, function totally doesn't work.test/chaoschaos testtriage/acceptedIndicates an issue or PR is ready to be actively worked on.

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions