Skip to content

[Bug]: The 4c16g query node still experienced OOM issues even after memory protection was set up with a low water level of 0.75 and a high wate level of 0.85. #39866

Open
@zhuwenxing

Description

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:master-20250213-dccba87f-amd64
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

keep OOM

❯ k get pod|grep fts-stable-test-13
fts-stable-test-13-etcd-0                                         1/1     Running            0                163m
fts-stable-test-13-etcd-1                                         1/1     Running            0                163m
fts-stable-test-13-etcd-2                                         1/1     Running            0                163m
fts-stable-test-13-kafka-0                                        2/2     Running            1 (162m ago)     163m
fts-stable-test-13-kafka-1                                        2/2     Running            0                163m
fts-stable-test-13-kafka-2                                        2/2     Running            0                163m
fts-stable-test-13-kafka-exporter-79c8654c6d-xx7gd                1/1     Running            3 (162m ago)     163m
fts-stable-test-13-milvus-datanode-7596f777cd-vchhw               1/1     Running            2 (162m ago)     163m
fts-stable-test-13-milvus-datanode-7596f777cd-vv5nj               1/1     Running            2 (162m ago)     163m
fts-stable-test-13-milvus-indexnode-84c577bd8c-5vn8c              1/1     Running            2 (162m ago)     163m
fts-stable-test-13-milvus-indexnode-84c577bd8c-rqv8m              1/1     Running            2 (162m ago)     163m
fts-stable-test-13-milvus-mixcoord-74f4998554-rjkzg               1/1     Running            2 (162m ago)     163m
fts-stable-test-13-milvus-proxy-7c4dd9bb48-cfsrj                  1/1     Running            2 (162m ago)     163m
fts-stable-test-13-milvus-querynode-7d4ff9657-7v9hd               1/1     Running            2 (162m ago)     163m
fts-stable-test-13-milvus-querynode-7d4ff9657-9fzns               0/1     CrashLoopBackOff   21 (74s ago)     163m
fts-stable-test-13-milvus-querynode-7d4ff9657-flvww               0/1     CrashLoopBackOff   20 (2m43s ago)   163m
fts-stable-test-13-minio-0                                        1/1     Running            0                163m
fts-stable-test-13-minio-1                                        1/1     Running            0                163m
fts-stable-test-13-minio-2                                        1/1     Running            0                163m
fts-stable-test-13-minio-3                                        1/1     Running            0                163m
fts-stable-test-13-zookeeper-0                                    1/1     Running            0                163m
fts-stable-test-13-zookeeper-1                                    1/1     Running            0                163m
fts-stable-test-13-zookeeper-2                                    1/1     Running            0                163m

[2025-02-13T08:32:49.209Z] extraConfigFiles:

[2025-02-13T08:32:49.209Z]   user.yaml: |+

[2025-02-13T08:32:49.209Z]     dataCoord:

[2025-02-13T08:32:49.209Z]       compaction:

[2025-02-13T08:32:49.209Z]         indexBasedCompaction: false

[2025-02-13T08:32:49.209Z]     indexCoord:

[2025-02-13T08:32:49.209Z]       scheduler:

[2025-02-13T08:32:49.209Z]         interval: 100

[2025-02-13T08:32:49.209Z]     queryNode:

[2025-02-13T08:32:49.209Z]         mmap:

[2025-02-13T08:32:49.209Z]           vectorField: true

[2025-02-13T08:32:49.209Z]           vectorIndex: true

[2025-02-13T08:32:49.209Z]           scalarField: true

[2025-02-13T08:32:49.209Z]           scalarIndex: true        

[2025-02-13T08:32:49.209Z]     quotaAndLimits:

[2025-02-13T08:32:49.209Z]       limitWriting:

[2025-02-13T08:32:49.209Z]         memProtection:

[2025-02-13T08:32:49.209Z]           dataNodeMemoryLowWaterLevel: 0.75

[2025-02-13T08:32:49.209Z]           dataNodeMemoryHighWaterLevel: 0.85

[2025-02-13T08:32:49.209Z]           queryNodeMemoryLowWaterLevel: 0.75

[2025-02-13T08:32:49.209Z]           queryNodeMemoryHighWaterLevel: 0.85

[2025-02-13T08:32:49.209Z]     trace:

[2025-02-13T08:32:49.209Z]       exporter: jaeger

[2025-02-13T08:32:49.209Z]       sampleFraction: 1

[2025-02-13T08:32:49.209Z]       jaeger:

[2025-02-13T08:32:49.214Z]         url: http://tempo-distributor.tempo:14268/api/traces

Image

Expected Behavior

not OOM

Steps To Reproduce

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/full%20text%20search%20stable%20test/detail/full%20text%20search%20stable%20test/13/pipeline
log:

artifacts-fts-stable-test-13-server-logs.tar.gz

cluster: 4am
ns: chaos-testing
pod info

[2025-02-13T11:18:03.077Z] + kubectl get pods -o wide

[2025-02-13T11:18:03.083Z] + grep fts-stable-test-13

[2025-02-13T11:18:03.342Z] fts-stable-test-13-etcd-0                                         1/1     Running            0                164m    10.104.15.118   4am-node20   <none>           <none>

[2025-02-13T11:18:03.342Z] fts-stable-test-13-etcd-1                                         1/1     Running            0                164m    10.104.24.106   4am-node29   <none>           <none>

[2025-02-13T11:18:03.342Z] fts-stable-test-13-etcd-2                                         1/1     Running            0                164m    10.104.26.95    4am-node32   <none>           <none>

[2025-02-13T11:18:03.342Z] fts-stable-test-13-kafka-0                                        2/2     Running            1 (164m ago)     164m    10.104.26.90    4am-node32   <none>           <none>

[2025-02-13T11:18:03.342Z] fts-stable-test-13-kafka-1                                        2/2     Running            0                164m    10.104.24.107   4am-node29   <none>           <none>

[2025-02-13T11:18:03.342Z] fts-stable-test-13-kafka-2                                        2/2     Running            0                164m    10.104.16.5     4am-node21   <none>           <none>

[2025-02-13T11:18:03.342Z] fts-stable-test-13-kafka-exporter-79c8654c6d-xx7gd                1/1     Running            3 (164m ago)     164m    10.104.23.236   4am-node27   <none>           <none>

[2025-02-13T11:18:03.342Z] fts-stable-test-13-milvus-datanode-7596f777cd-vchhw               1/1     Running            2 (164m ago)     164m    10.104.21.9     4am-node24   <none>           <none>

[2025-02-13T11:18:03.342Z] fts-stable-test-13-milvus-datanode-7596f777cd-vv5nj               1/1     Running            2 (164m ago)     164m    10.104.30.218   4am-node38   <none>           <none>

[2025-02-13T11:18:03.342Z] fts-stable-test-13-milvus-indexnode-84c577bd8c-5vn8c              1/1     Running            2 (164m ago)     164m    10.104.23.238   4am-node27   <none>           <none>

[2025-02-13T11:18:03.342Z] fts-stable-test-13-milvus-indexnode-84c577bd8c-rqv8m              1/1     Running            2 (164m ago)     164m    10.104.25.224   4am-node30   <none>           <none>

[2025-02-13T11:18:03.342Z] fts-stable-test-13-milvus-mixcoord-74f4998554-rjkzg               1/1     Running            2 (164m ago)     164m    10.104.30.219   4am-node38   <none>           <none>

[2025-02-13T11:18:03.342Z] fts-stable-test-13-milvus-proxy-7c4dd9bb48-cfsrj                  1/1     Running            2 (164m ago)     164m    10.104.23.235   4am-node27   <none>           <none>

[2025-02-13T11:18:03.342Z] fts-stable-test-13-milvus-querynode-7d4ff9657-7v9hd               1/1     Running            2 (164m ago)     164m    10.104.23.237   4am-node27   <none>           <none>

[2025-02-13T11:18:03.342Z] fts-stable-test-13-milvus-querynode-7d4ff9657-9fzns               0/1     CrashLoopBackOff   21 (3m ago)      164m    10.104.30.220   4am-node38   <none>           <none>

[2025-02-13T11:18:03.342Z] fts-stable-test-13-milvus-querynode-7d4ff9657-flvww               0/1     CrashLoopBackOff   20 (4m29s ago)   164m    10.104.32.80    4am-node39   <none>           <none>

[2025-02-13T11:18:03.342Z] fts-stable-test-13-minio-0                                        1/1     Running            0                164m    10.104.15.119   4am-node20   <none>           <none>

[2025-02-13T11:18:03.342Z] fts-stable-test-13-minio-1                                        1/1     Running            0                164m    10.104.24.105   4am-node29   <none>           <none>

[2025-02-13T11:18:03.342Z] fts-stable-test-13-minio-2                                        1/1     Running            0                164m    10.104.26.94    4am-node32   <none>           <none>

[2025-02-13T11:18:03.342Z] fts-stable-test-13-minio-3                                        1/1     Running            0                164m    10.104.17.29    4am-node23   <none>           <none>

[2025-02-13T11:18:03.342Z] fts-stable-test-13-zookeeper-0                                    1/1     Running            0                164m    10.104.15.117   4am-node20   <none>           <none>

[2025-02-13T11:18:03.342Z] fts-stable-test-13-zookeeper-1                                    1/1     Running            0                164m    10.104.26.93    4am-node32   <none>           <none>

[2025-02-13T11:18:03.342Z] fts-stable-test-13-zookeeper-2                                    1/1     Running            0                164m    10.104.24.109   4am-node29   <none>           <none>

Anything else?

No response

Metadata

Assignees

Labels

kind/bugIssues or changes related a bugpriority/critical-urgentHighest priority. Must be actively worked on as someone's top priority right now.severity/criticalCritical, lead to crash, data missing, wrong result, function totally doesn't work.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions