Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: datanode memory usage increased to 150GB when there are 50m vectors to be flush #26177

Closed
1 task done
yanliang567 opened this issue Aug 7, 2023 · 18 comments
Closed
1 task done
Assignees
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@yanliang567
Copy link
Contributor

yanliang567 commented Aug 7, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version: master-20230805-241117dd
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

I scaled the datanode down to 0, and insert 50m_768d vectors. Then I scaled the datanode up to 1, the datanode memory usage increased to 150GB in 15 mins
image

Expected Behavior

the baseline on master-20230802-df26b909: datanode memory usage is about 2.3-1.3GB for the same size of vectors.

Steps To Reproduce

1. create a collection with 20k_768d vectors, build hnsw index
2. scale down the datanode to 0
3. insert 50m_768d vectors
4. scale up the datanode back to 1
5. wait and check the tt lag, datanode cpu and memory

Milvus Log

pod names on devops:

yanliang-ttlag-milvus-datanode-cbf79cbdc-bx4h6                  1/1     Running       0               34m     10.102.7.245    devops-node11   <none>           <none>
yanliang-ttlag-milvus-indexnode-6699c566d7-9l49n                1/1     Running       2 (2m16s ago)   6h54m   10.102.7.231    devops-node11   <none>           <none>
yanliang-ttlag-milvus-mixcoord-987654d85-pfzg2                  1/1     Running       0               6h54m   10.102.7.238    devops-node11   <none>           <none>
yanliang-ttlag-milvus-proxy-df7b5955f-5twjd                     1/1     Running       0               6h54m   10.102.7.239    devops-node11   <none>           <none>
yanliang-ttlag-milvus-querynode-76cf9c9b55-rcwx9                1/1     Running       0               6h54m   10.102.7.232    devops-node11   <none>           <none>

Anything else?

the suspected pr: #26144

@yanliang567 yanliang567 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 7, 2023
@yanliang567 yanliang567 self-assigned this Aug 7, 2023
@yanliang567
Copy link
Contributor Author

/assign @congqixia
/unassign

@sre-ci-robot sre-ci-robot assigned congqixia and unassigned yanliang567 Aug 7, 2023
@yanliang567 yanliang567 added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Aug 7, 2023
@yanliang567 yanliang567 added this to the 2.3 milestone Aug 7, 2023
@congqixia
Copy link
Contributor

from the pprof, there are lot's of msg pack buffered in memory
image

there are some channels that are too large which could cause this problem:

  • MsgStream buffers(mq buffer & receive buffer) 1024*2
  • Flowgraph node buffer (input node -> dd node -> insert buffer node) 1024*2

in high read pressure, all channels shall be full with will lead to 102448MB memory cost

And the flush manager will buffer flush task as well, which will multiply this memory cost.

@congqixia
Copy link
Contributor

@yanliang567 after #26179 merged, could you please verify with this parameter enlarged

dataNode:
  dataSync:
    maxParallelSyncTaskNum: 2 # Maximum number of sync tasks executed in parallel in each flush manager

@xiaofan-luan
Copy link
Collaborator

can we simplify the mqstream logic to make it easier to understand?

@yanliang567 yanliang567 modified the milestones: 2.3, 2.3.2 Oct 11, 2023
@yanliang567
Copy link
Contributor Author

@congqixia any plans for fixing this issue in v2.3.2?

@congqixia
Copy link
Contributor

congqixia commented Oct 16, 2023

@yanliang567 nope, l0 delta and other datanode refining will be implemented after 2.3.2

@yanliang567
Copy link
Contributor Author

moving to 2.3.3

@yanliang567 yanliang567 modified the milestones: 2.3.2, 2.3.3 Oct 16, 2023
@yanliang567 yanliang567 modified the milestones: 2.3.3, 2.3.4 Nov 16, 2023
@yanliang567
Copy link
Contributor Author

moving to 2.4 for L0 deletion

@yanliang567 yanliang567 modified the milestones: 2.3.4, 2.4.0 Dec 5, 2023
@congqixia
Copy link
Contributor

@yanliang567 now we shall verify where this problem persists when l0 segment is enabled
/assign @yanliang567

@yanliang567
Copy link
Contributor Author

will do as L0 segment enabled.
/unassign @congqixia

@yanliang567 yanliang567 removed the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Mar 5, 2024
@yanliang567
Copy link
Contributor Author

test on 2.4-20240407-e3b65203-amd64
datanode memory goes to 38GB
image

and ttlag catch up from 5.3h to 200ms in about 60mins
image

@xiaofan-luan
Copy link
Collaborator

I thought the key might to increase flush concurrency make sure flush can catch up insertion rate

@xiaofan-luan
Copy link
Collaborator

/assign @congqixia

@congqixia
Copy link
Contributor

@xiaofan-luan the scenario here is to verify the datanode behavior when datanode is down for a long time

@yanliang567 the last run did not limit the memory of datanode. Memory usage went to arount 40GB. Maybe it's still an issue here. Let's check what the behavior is when datanode has memory limit.

The catch-up time is about one hour for 5 hour ttlag with insertion. Does this value good enough for our system? @xiaofan-luan @yanliang567 @tedxu @jaime0815

@xiaofan-luan
Copy link
Collaborator

@xiaofan-luan the scenario here is to verify the datanode behavior when datanode is down for a long time

@yanliang567 the last run did not limit the memory of datanode. Memory usage went to arount 40GB. Maybe it's still an issue here. Let's check what the behavior is when datanode has memory limit.

The catch-up time is about one hour for 5 hour ttlag with insertion. Does this value good enough for our system? @xiaofan-luan @yanliang567 @tedxu @jaime0815

  1. How long do we stop the cluster?
  2. Is there anything we can improve? what is the bottleneck?

@yanliang567
Copy link
Contributor Author

we did not stop the cluster, we just scale down the datanode replica to 0, and insert for 6 hours(~50M_768d data), and then bring one datanode up.
@congqixia is trying to make a pr

congqixia added a commit to congqixia/milvus that referenced this issue Apr 11, 2024
See also milvus-io#27675 milvus-io#26177

Make memory check evict memory buffer until memory water level is safe.
Also make `EvictBuffer` wait until sync task done.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
congqixia added a commit to congqixia/milvus that referenced this issue Apr 11, 2024
See also milvus-io#27675 milvus-io#26177

Make memory check evict memory buffer until memory water level is safe.
Also make `EvictBuffer` wait until sync task done.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue Apr 12, 2024
See also #27675 #26177

Make memory check evict memory buffer until memory water level is safe.
Also make `EvictBuffer` wait until sync task done.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
congqixia added a commit to congqixia/milvus that referenced this issue Apr 12, 2024
See also milvus-io#27675 milvus-io#26177

Make memory check evict memory buffer until memory water level is safe.
Also make `EvictBuffer` wait until sync task done.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
sre-ci-robot pushed a commit that referenced this issue Apr 12, 2024
…32172) (#32201)

Cherry-pick from master
pr: #32172
See also #27675 #26177

Make memory check evict memory buffer until memory water level is safe.
Also make `EvictBuffer` wait until sync task done.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
@yanliang567 yanliang567 modified the milestones: 2.4.0, 2.4.1 Apr 18, 2024
@yanliang567
Copy link
Contributor Author

on master-20240426-bed6363f, the tt lag catches up quickly, but the datanode uses memory without any limitation, OOM occurs for times in a 8c32g datanode pod.
image
image

@yanliang567 yanliang567 modified the milestones: 2.4.1, 2.4.2 May 7, 2024
Copy link

stale bot commented Jun 10, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

@stale stale bot added the stale indicates no udpates for 30 days label Jun 10, 2024
@stale stale bot closed this as completed Jul 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug stale indicates no udpates for 30 days triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

3 participants