-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Heap space goes out of memory and the node crashes when taking snapshots #14666
Comments
I did another snapshot test today. For this test I set the resources of the data node pods to the following. Limits:
cpu: 1500m
memory: 2100Mi
Requests:
cpu: 1200m
memory: 2000Mi Then I chose an index whose primary store size is 11.9GB. First I triggered a snapshot of that index to an Azure storage account repository. The snapshot process started. It was in an Next, I triggered a snapshot of that same index to an S3 repository. This ran for about 6minutes and the snapshot succeeded. There were no node crashes. So the problem appears to be with snapshots to Azure storage accounts. Could it be a memory leak in the Furthermore, from what we understand, circuit breakers should prevent the heap going out of memory. But in this case, that also did not happen |
In the sample above can we share more details on the the number of snapshot threads that has been configured per container. |
@gulgulni, Are you looking for this value? I got it from cluster settings using the "defaults": {
"snapshot.max_concurrent_operations": "1000"
} If not, could you let me know where I can get that detail from? |
I tested this with several OpenSearch versions and these are the results. |
@linuxpi please find the histogram of the heap dump below. |
@nilushancosta looks like 'Problem suspect 1' is worth diving into. Can you share the stacktrace and stacktrace with involved local variables ? |
@linuxpi please find the stack trace below
The stacktrace with involved local variables is expandable as shown below. Therefore could you please let me know which ones you need? |
@linuxpi , did you get a chance to look into this? |
Describe the bug
When I try to take a snapshot to an Azure Storage account (by using the
repository-azure
plugin), the data node that is carrying out the snapshot process crashes resulting in the snapshot to fail.A short while after starting the snapshot, it fails. Snapshot details API shows
node shutdown
as the reason for the failure.E.g.
The heap memory of the data node that is taking the snapshot goes out of memory causing the pod to crash and restart.
Following is the log printed in a data node before it crashed
Initially the OpenSearch cluster had the following resources allocated for each data node.
So these data nodes had a heap of 700Mi as the heap will be set to 50% of requested memory by default.
When I tried to take a snapshot of one index (which had a primary shard of 1.1GB and 2 replica shards), the data node crashed with the above error.
When I increased the resources to the following (which results in a heap size of 1Gi), I was able to take the snapshot with one index.
But when I tried to take a snapshot with more indexes using the above resources, the data node’s Java heap went out of memory again. I tried to do it several time but they all resulted in the snapshot failing due to the above heap memory issue.
Related component
Storage:Snapshots
To Reproduce
Set the following resources in data nodes
Try to take a snapshot of an index which has a shard size more than 1.1GB
Expected behavior
Expected the snapshot to complete
Additional Details
Host/Environment (please complete the following information):
Kubernetes 1.28.5
OpenSearch 2.11.1
OpenSearch Operator 2.6.0
Additional context
The OpenSearch cluster runs with 3 dedicated master nodes and 3 dedicated data nodes. Container logs are collected by Fluent Bit and sent to OpenSearch. They are published to a daily index. There is 1 primary shard and 2 replica shards per index. An Index State Management policy deletes indexes older than 30 days.
The text was updated successfully, but these errors were encountered: