Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Heap space goes out of memory and the node crashes when taking snapshots #14666

Open
nilushancosta opened this issue Jul 7, 2024 · 9 comments
Labels

Comments

@nilushancosta
Copy link

nilushancosta commented Jul 7, 2024

Describe the bug

When I try to take a snapshot to an Azure Storage account (by using the repository-azure plugin), the data node that is carrying out the snapshot process crashes resulting in the snapshot to fail.

A short while after starting the snapshot, it fails. Snapshot details API shows node shutdown as the reason for the failure.
E.g.

"failures": [
        {
          "index": "logs-2024-06-20",
          "index_uuid": "logs-2024-06-20",
          "shard_id": 0,
          "reason": "node shutdown",
          "node_id": "7WaZsHPmQ4apTHxhR4qgGQ",
          "status": "INTERNAL_SERVER_ERROR"
        }
]

The heap memory of the data node that is taking the snapshot goes out of memory causing the pod to crash and restart.
Following is the log printed in a data node before it crashed

[2024-07-05T06:13:21,807][INFO ][o.o.j.s.JobSweeper       ] [opensearch-data-0] Running full sweep
[2024-07-05T06:13:22,520][INFO ][o.o.a.u.d.DestinationMigrationCoordinator] [opensearch-data-0] Detected cluster change event for destination migration
[2024-07-05T06:13:25,039][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] attempting to trigger G1GC due to high heap usage [844187136]
[2024-07-05T06:13:25,059][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] GC did bring memory usage down, before [844187136], after [714688000], allocations [186], duration [20]
[2024-07-05T06:13:30,133][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] attempting to trigger G1GC due to high heap usage [1004094976]
[2024-07-05T06:13:30,146][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] GC did bring memory usage down, before [1004094976], after [903587328], allocations [33], duration [13]
[2024-07-05T06:13:35,159][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] attempting to trigger G1GC due to high heap usage [898535424]
[2024-07-05T06:13:35,175][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] GC did bring memory usage down, before [898535424], after [662671376], allocations [136], duration [16]
[2024-07-05T06:13:43,346][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] attempting to trigger G1GC due to high heap usage [839589376]
[2024-07-05T06:13:43,364][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] GC did bring memory usage down, before [839589376], after [814423552], allocations [191], duration [18]
[2024-07-05T06:13:46,713][WARN ][o.o.m.j.JvmGcMonitorService] [opensearch-data-0] [gc][67511] overhead, spent [985ms] collecting in the last [1.2s]
java.lang.OutOfMemoryError: Java heap space
Dumping heap to data/java_pid30.hprof ...
[2024-07-05T06:13:48,764][WARN ][o.o.m.j.JvmGcMonitorService] [opensearch-data-0] [gc][67512] overhead, spent [1.9s] collecting in the last [2s]
[2024-07-05T06:13:48,764][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] attempting to trigger G1GC due to high heap usage [1015699088]
Heap dump file created [1188532983 bytes in 7.458 secs]
[2024-07-05T06:13:56,221][INFO ][o.o.i.b.HierarchyCircuitBreakerService] [opensearch-data-0] GC did not bring memory usage down, before [1015699088], after [1016929696], allocations [1], duration [7457]
[2024-07-05T06:13:57,069][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [opensearch-data-0] fatal error in thread [opensearch[opensearch-data-0][snapshot][T#1]], exiting
java.lang.OutOfMemoryError: Java heap space
	at io.netty.util.internal.PlatformDependent.allocateUninitializedArray(PlatformDependent.java:323) ~[?:?]
	at io.netty.buffer.PoolArena$HeapArena.newByteArray(PoolArena.java:635) ~[?:?]
	at io.netty.buffer.PoolArena$HeapArena.newChunk(PoolArena.java:646) ~[?:?]
	at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:215) ~[?:?]
	at io.netty.buffer.PoolArena.tcacheAllocateSmall(PoolArena.java:180) ~[?:?]
	at io.netty.buffer.PoolArena.allocate(PoolArena.java:137) ~[?:?]
	at io.netty.buffer.PoolArena.allocate(PoolArena.java:129) ~[?:?]
	at io.netty.buffer.PooledByteBufAllocator.newHeapBuffer(PooledByteBufAllocator.java:378) ~[?:?]
	at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:169) ~[?:?]
	at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:160) ~[?:?]
	at io.netty.handler.ssl.SslHandler$SslEngineType$3.allocateWrapBuffer(SslHandler.java:335) ~[?:?]
	at io.netty.handler.ssl.SslHandler.allocateOutNetBuf(SslHandler.java:2364) ~[?:?]
	at io.netty.handler.ssl.SslHandler.wrap(SslHandler.java:866) ~[?:?]
	at io.netty.handler.ssl.SslHandler.wrapAndFlush(SslHandler.java:821) ~[?:?]
	at io.netty.handler.ssl.SslHandler.flush(SslHandler.java:802) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:925) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:907) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:893) ~[?:?]
	at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.flush(CombinedChannelDuplexHandler.java:531) ~[?:?]
	at io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:125) ~[?:?]
	at io.netty.channel.CombinedChannelDuplexHandler.flush(CombinedChannelDuplexHandler.java:356) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:923) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:907) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:893) ~[?:?]
	at reactor.netty.channel.MonoSendMany$SendManyInner.run(MonoSendMany.java:325) ~[?:?]
	at reactor.netty.channel.MonoSendMany$SendManyInner.trySchedule(MonoSendMany.java:434) ~[?:?]
	at reactor.netty.channel.MonoSendMany$SendManyInner.onNext(MonoSendMany.java:223) ~[?:?]
	at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:122) ~[?:?]
	at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:122) ~[?:?]
	at reactor.core.publisher.FluxHandle$HandleSubscriber.onNext(FluxHandle.java:128) ~[?:?]
	at reactor.core.publisher.FluxConcatArray$ConcatArraySubscriber.onNext(FluxConcatArray.java:201) ~[?:?]
	at reactor.core.publisher.FluxIterable$IterableSubscription.slowPath(FluxIterable.java:335) ~[?:?]
fatal error in thread [opensearch[opensearch-data-0][snapshot][T#1]], exiting
java.lang.OutOfMemoryError: Java heap space
	at io.netty.util.internal.PlatformDependent.allocateUninitializedArray(PlatformDependent.java:323)
	at io.netty.buffer.PoolArena$HeapArena.newByteArray(PoolArena.java:635)
	at io.netty.buffer.PoolArena$HeapArena.newChunk(PoolArena.java:646)
	at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:215)
	at io.netty.buffer.PoolArena.tcacheAllocateSmall(PoolArena.java:180)
	at io.netty.buffer.PoolArena.allocate(PoolArena.java:137)
	at io.netty.buffer.PoolArena.allocate(PoolArena.java:129)
	at io.netty.buffer.PooledByteBufAllocator.newHeapBuffer(PooledByteBufAllocator.java:378)
	at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:169)
	at io.netty.buffer.AbstractByteBufAllocator.heapBuffer(AbstractByteBufAllocator.java:160)
	at io.netty.handler.ssl.SslHandler$SslEngineType$3.allocateWrapBuffer(SslHandler.java:335)
	at io.netty.handler.ssl.SslHandler.allocateOutNetBuf(SslHandler.java:2364)
	at io.netty.handler.ssl.SslHandler.wrap(SslHandler.java:866)
	at io.netty.handler.ssl.SslHandler.wrapAndFlush(SslHandler.java:821)
	at io.netty.handler.ssl.SslHandler.flush(SslHandler.java:802)
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:925)
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:907)
	at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:893)
	at io.netty.channel.CombinedChannelDuplexHandler$DelegatingChannelHandlerContext.flush(CombinedChannelDuplexHandler.java:531)
	at io.netty.channel.ChannelOutboundHandlerAdapter.flush(ChannelOutboundHandlerAdapter.java:125)
	at io.netty.channel.CombinedChannelDuplexHandler.flush(CombinedChannelDuplexHandler.java:356)
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush0(AbstractChannelHandlerContext.java:923)
	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:907)
	at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:893)
	at reactor.netty.channel.MonoSendMany$SendManyInner.run(MonoSendMany.java:325)
	at reactor.netty.channel.MonoSendMany$SendManyInner.trySchedule(MonoSendMany.java:434)
	at reactor.netty.channel.MonoSendMany$SendManyInner.onNext(MonoSendMany.java:223)
	at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:122)
	at reactor.core.publisher.FluxMap$MapSubscriber.onNext(FluxMap.java:122)
	at reactor.core.publisher.FluxHandle$HandleSubscriber.onNext(FluxHandle.java:128)
	at reactor.core.publisher.FluxConcatArray$ConcatArraySubscriber.onNext(FluxConcatArray.java:201)
	at reactor.core.publisher.FluxIterable$IterableSubscription.slowPath(FluxIterable.java:335)

Initially the OpenSearch cluster had the following resources allocated for each data node.

requests:
    cpu: "500m"
    memory: "1400Mi"
limits:
    cpu: "800m"
    memory: "1700Mi"

So these data nodes had a heap of 700Mi as the heap will be set to 50% of requested memory by default.
When I tried to take a snapshot of one index (which had a primary shard of 1.1GB and 2 replica shards), the data node crashed with the above error.

When I increased the resources to the following (which results in a heap size of 1Gi), I was able to take the snapshot with one index.

requests:
    cpu: "500m"
    memory: "2000Mi"
limits:
    cpu: "800m"
    memory: "2100Mi"

But when I tried to take a snapshot with more indexes using the above resources, the data node’s Java heap went out of memory again. I tried to do it several time but they all resulted in the snapshot failing due to the above heap memory issue.

Related component

Storage:Snapshots

To Reproduce

Set the following resources in data nodes

requests:
    cpu: "500m"
    memory: "1400Mi"
limits:
    cpu: "800m"
    memory: "1700Mi"

Try to take a snapshot of an index which has a shard size more than 1.1GB

Expected behavior

Expected the snapshot to complete

Additional Details

Host/Environment (please complete the following information):
Kubernetes 1.28.5
OpenSearch 2.11.1
OpenSearch Operator 2.6.0

Additional context
The OpenSearch cluster runs with 3 dedicated master nodes and 3 dedicated data nodes. Container logs are collected by Fluent Bit and sent to OpenSearch. They are published to a daily index. There is 1 primary shard and 2 replica shards per index. An Index State Management policy deletes indexes older than 30 days.

@nilushancosta nilushancosta added bug Something isn't working untriaged labels Jul 7, 2024
@nilushancosta
Copy link
Author

nilushancosta commented Jul 9, 2024

I did another snapshot test today.

For this test I set the resources of the data node pods to the following.

Limits:
  cpu: 1500m
  memory: 2100Mi
Requests:
  cpu: 1200m
  memory: 2000Mi

Then I chose an index whose primary store size is 11.9GB.

First I triggered a snapshot of that index to an Azure storage account repository. The snapshot process started. It was in an IN_PROGRESS state for about one minute and then failed with the same node_shutdown error. The data node had the same heap memory out of memory error leading to a pod restart.

Next, I triggered a snapshot of that same index to an S3 repository. This ran for about 6minutes and the snapshot succeeded. There were no node crashes.

So the problem appears to be with snapshots to Azure storage accounts. Could it be a memory leak in the repository-azure plugin?

Furthermore, from what we understand, circuit breakers should prevent the heap going out of memory. But in this case, that also did not happen

@gulgulni
Copy link

In the sample above can we share more details on the the number of snapshot threads that has been configured per container.

@nilushancosta
Copy link
Author

@gulgulni, Are you looking for this value? I got it from cluster settings using the /_cluster/settings API

"defaults": {
   "snapshot.max_concurrent_operations": "1000"
}

If not, could you let me know where I can get that detail from?

@nilushancosta
Copy link
Author

nilushancosta commented Jul 23, 2024

I tested this with several OpenSearch versions and these are the results.
2.15.0 - Issue exists
2.11.1 - Issue exists
2.9.0 - Issue exists
2.8.0 - Issue does not exist. snapshot works

@linuxpi
Copy link
Collaborator

linuxpi commented Jul 25, 2024

[Storage Triage - attendees 1 2 3 4 5 6 7 8]

@nilushancosta could you share a histogram heap dump, it would help analyze whats causing this

@nilushancosta
Copy link
Author

nilushancosta commented Jul 31, 2024

@linuxpi please find the histogram of the heap dump below.
This particular data node pod had 2000Mi as requests and 2100Mi as limits. And -Xmx1000M -Xms1000M

Screenshot 2024-07-31 at 17 18 13

Please also find the leak suspect report
Screenshot 2024-07-31 at 17 30 22
Screenshot 2024-07-31 at 17 30 32
Screenshot 2024-07-31 at 17 30 44
Screenshot 2024-07-31 at 17 30 52

@linuxpi
Copy link
Collaborator

linuxpi commented Aug 5, 2024

@nilushancosta looks like 'Problem suspect 1' is worth diving into. Can you share the stacktrace and stacktrace with involved local variables ?

@nilushancosta
Copy link
Author

@linuxpi please find the stack trace below

opensearch[opensearch-data-0][snapshot][T#1]
  at java.lang.OutOfMemoryError.<init>()V (OutOfMemoryError.java:48)
  at com.azure.storage.common.Utility.lambda$convertStreamToByteBuffer$1([JIJLjava/io/InputStream;Ljava/io/InputStream;Lreactor/core/publisher/SynchronousSink;)Ljava/io/InputStream; (Utility.java:264)
  at com.azure.storage.common.Utility$$Lambda$6369+0x00007f4b211d95a8.apply(Ljava/lang/Object;Ljava/lang/Object;)Ljava/lang/Object; ()
  at reactor.core.publisher.FluxGenerate$GenerateSubscription.slowPath(J)V (FluxGenerate.java:271)
  at reactor.core.publisher.FluxGenerate$GenerateSubscription.request(J)V (FluxGenerate.java:213)
  at reactor.core.publisher.FluxFilter$FilterSubscriber.request(J)V (FluxFilter.java:186)
  at reactor.core.publisher.FluxConcatMapNoPrefetch$FluxConcatMapNoPrefetchSubscriber.request(J)V (FluxConcatMapNoPrefetch.java:336)
  at reactor.core.publisher.FluxSwitchOnFirst$AbstractSwitchOnFirstMain.onSubscribe(Lorg/reactivestreams/Subscription;)V (FluxSwitchOnFirst.java:499)
  at reactor.core.publisher.FluxConcatMapNoPrefetch$FluxConcatMapNoPrefetchSubscriber.onSubscribe(Lorg/reactivestreams/Subscription;)V (FluxConcatMapNoPrefetch.java:164)
  at reactor.core.publisher.FluxFilter$FilterSubscriber.onSubscribe(Lorg/reactivestreams/Subscription;)V (FluxFilter.java:85)
  at reactor.core.publisher.FluxGenerate.subscribe(Lreactor/core/CoreSubscriber;)V (FluxGenerate.java:85)
  at reactor.core.publisher.FluxDefer.subscribe(Lreactor/core/CoreSubscriber;)V (FluxDefer.java:54)
  at reactor.core.publisher.Mono.subscribe(Lorg/reactivestreams/Subscriber;)V (Mono.java:4485)
  at reactor.core.publisher.Mono.block()Ljava/lang/Object; (Mono.java:1709)
  at com.azure.storage.common.implementation.StorageImplUtils.blockWithOptionalTimeout(Lreactor/core/publisher/Mono;Ljava/time/Duration;)Ljava/lang/Object; (StorageImplUtils.java:133)
  at com.azure.storage.blob.BlobClient.uploadWithResponse(Lcom/azure/storage/blob/options/BlobParallelUploadOptions;Ljava/time/Duration;Lcom/azure/core/util/Context;)Lcom/azure/core/http/rest/Response; (BlobClient.java:337)
  at org.opensearch.repositories.azure.AzureBlobStore.lambda$writeBlob$18(Lcom/azure/storage/blob/BlobClient;Ljava/io/InputStream;JLcom/azure/storage/blob/models/BlobRequestConditions;Lorg/opensearch/common/collect/Tuple;Ljava/lang/String;)V (AzureBlobStore.java:365)
  at org.opensearch.repositories.azure.AzureBlobStore$$Lambda$6356+0x00007f4b211b87b8.executeCouldThrow()V ()
  at org.opensearch.repositories.azure.SocketAccess.lambda$doPrivilegedVoidException$0(Lorg/opensearch/repositories/azure/SocketAccess$StorageRunnable;)Ljava/lang/Void; (SocketAccess.java:82)
  at org.opensearch.repositories.azure.SocketAccess$$Lambda$6297+0x00007f4b210ea090.run()Ljava/lang/Object; ()
  at java.security.AccessController.executePrivileged(Ljava/security/PrivilegedExceptionAction;Ljava/security/AccessControlContext;Ljava/lang/Class;)Ljava/lang/Object; (AccessController.java:807)
  at java.security.AccessController.doPrivileged(Ljava/security/PrivilegedExceptionAction;)Ljava/lang/Object; (AccessController.java:569)
  at org.opensearch.repositories.azure.SocketAccess.doPrivilegedVoidException(Lorg/opensearch/repositories/azure/SocketAccess$StorageRunnable;)V (SocketAccess.java:81)
  at org.opensearch.repositories.azure.AzureBlobStore.writeBlob(Ljava/lang/String;Ljava/io/InputStream;JZ)V (AzureBlobStore.java:364)
  at org.opensearch.repositories.azure.AzureBlobContainer.writeBlob(Ljava/lang/String;Ljava/io/InputStream;JZ)V (AzureBlobContainer.java:133)
  at org.opensearch.repositories.blobstore.BlobStoreRepository.snapshotFile(Lorg/opensearch/index/snapshots/blobstore/BlobStoreIndexShardSnapshot$FileInfo;Lorg/opensearch/repositories/IndexId;Lorg/opensearch/core/index/shard/ShardId;Lorg/opensearch/snapshots/SnapshotId;Lorg/opensearch/index/snapshots/IndexShardSnapshotStatus;Lorg/opensearch/index/store/Store;)V (BlobStoreRepository.java:3622)
  at org.opensearch.repositories.blobstore.BlobStoreRepository.lambda$executeOneFileSnapshot$70(Lorg/opensearch/index/store/Store;Lorg/opensearch/index/snapshots/IndexShardSnapshotStatus;Lorg/opensearch/core/index/shard/ShardId;Lorg/opensearch/index/snapshots/blobstore/BlobStoreIndexShardSnapshot$FileInfo;Lorg/opensearch/repositories/IndexId;Lorg/opensearch/snapshots/SnapshotId;Ljava/util/concurrent/BlockingQueue;Ljava/util/concurrent/Executor;Lorg/opensearch/core/action/ActionListener;)V (BlobStoreRepository.java:3051)
  at org.opensearch.repositories.blobstore.BlobStoreRepository$$Lambda$6534+0x00007f4b21174918.accept(Ljava/lang/Object;)V ()
  at org.opensearch.action.ActionRunnable$2.doRun()V (ActionRunnable.java:89)
  at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun()V (ThreadContext.java:908)
  at org.opensearch.common.util.concurrent.AbstractRunnable.run()V (AbstractRunnable.java:52)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V (ThreadPoolExecutor.java:1136)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run()V (ThreadPoolExecutor.java:635)
  at java.lang.Thread.run()V (Thread.java:833)

The stacktrace with involved local variables is expandable as shown below. Therefore could you please let me know which ones you need?
Screenshot 2024-08-09 at 01 05 31
Screenshot 2024-08-09 at 01 05 51

@nilushancosta
Copy link
Author

@linuxpi , did you get a chance to look into this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Ready To Be Picked
Development

No branches or pull requests

4 participants